Skip to end of metadata
Go to start of metadata

You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 5 Next »

Goals

To allow users to use the Hydrator drag and drop UI to easily create pipelines that run on Spark Streaming, leveraging built-in capabilities like windowing and machine learning.

Checklist

  • User stories documented (Albert)
  • User stories reviewed (Nitin)
  • Design documented (Albert)
  • Design reviewed (Terence/Andreas)
  • Feature merged ()
  • Examples and guides ()
  • Integration tests () 
  • Documentation for feature ()
  • Blog post

Use Cases

  1. ETL - The use cases solved using ETL realtime should also be solvable using Spark Streaming instead of a CDAP Worker.

  2. Machine Learning - An email client is set up to push an event to a Kafka topic whenever somebody uses the client to send an email. The client is also set up to push an event to another topic whenever an email is marked as spam. A pipeline developer wants to create a realtime pipeline that reads events from spam topic and trains a spam classification model in realtime using Streaming linear regression (http://spark.apache.org/docs/latest/mllib-linear-methods.html#streaming-linear-regression). The pipeline developer also wants to create another pipeline that reads from the email topic and adds a 'isSpam' field to each record based on the model trained by the other pipeline.
  3. Data enrichment - Every time a purchase is made on an online store, an event with purchase information is pushed to Kafka. The event contains a timestamp, purchase id, customer id, item id, and price. A pipeline developer wants to create a realtime pipeline that reads events from Kafka and joins customer information (email, age, gender, etc) to each event, then writes the events to a CDAP Table.

User Stories

Design

 

  • No labels