Goals

To allow users to use the Hydrator drag and drop UI to easily create pipelines that run on Spark Streaming, leveraging built-in capabilities like windowing and machine learning.

Checklist

User stories documented (Albert)
User stories reviewed (Nitin)
Design documented (Albert)
Design reviewed (Terence/Andreas)
Feature merged ()
Examples and guides ()
Integration tests ()
Documentation for feature ()
Blog post

Use Cases

ETL - An development team has some realtime Hydrator pipelines that use a CDAP Worker. They want to run their ETL pipelines using Spark Streaming because their company is standardizing on Spark.
Data enrichment - Every time a purchase is made on an online store, an event with purchase information is pushed to Kafka. The event contains a timestamp, purchase id, customer id, item id, and price. A pipeline developer wants to create a realtime pipeline that reads events from Kafka and joins customer information (email, age, gender, etc) to each event, then writes the events to a CDAP Table.
Machine Learning - An email client is set up to push an event to a Kafka topic whenever somebody uses the client to send an email. The client is also set up to push an event to another topic whenever an email is marked as spam. A pipeline developer wants to create a realtime pipeline that reads events from spam topic and trains a spam classification model in realtime using Streaming linear regression (http://spark.apache.org/docs/latest/mllib-linear-methods.html#streaming-linear-regression). The pipeline developer also wants to create another pipeline that reads from the email topic and adds a 'isSpam' field to each record based on the model trained by the other pipeline.
Windowing - Stock trade events are being pushed to Kafka. A pipeline developer wants to create a realtime pipeline that examines all trades made in various time windows and looks for trades in each window that were made for a significantly different amount of money than other trades in the window. The pipeline makes an HTTP call to some external system for each such trade event.

User Stories

As a pipeline developer, I want to create realtime ETL pipelines that run using Spark Streaming.
As a pipeline developer, I want to enrich streaming events by joining to other datasets.
As a pipeline developer, I want to be able to group events into time windows in my streaming pipeline.
As a pipeline developer, I want to train machine learning models in my streaming pipeline.
As a plugin developer, I want my transform, aggregate, and join plugins to work in both Spark Streaming and Data Pipelines.
As a plugin developer, I want to be able to use features available in Spark Streaming like MLLib to write plugins.

Design

We will introduce a new artifact similar to the DataPipeline artifact, called the DataStreaming artifact. This artifact will support the transform, sparkcompute, and joiner plugin types. In addition, we will add streamingsource, streamingsink, and window plugin types.

The streamingsource plugin type simply takes a JavaSparkContext and returns a JavaDStream:

public interface StreamingSource<T> {
 
  JavaDStream<T> getStream(JavaSparkContext jsc);
}

Implementation for basic sources (http://spark.apache.org/docs/latest/streaming-programming-guide.html#basic-sources) should mostly involve just defining what is configurable and translating relevant objects into StructuredRecords :

public class TextFileWatcherSource implements StreamingSource<StructuredRecord> {
  private final Conf conf;
 
  JavaDStream<StructuredRecord> getStream(JavaSparkContext jsc) {
    JavaStreamingContext jssc = new JavaStreamingContext(jsc, Durations.seconds(conf.batchSize));
    return jssc.fileStream(conf.path, LongWritable.class, Text.class, TextInputFormat.class)
      .map(new TextToStructuredRecordFunction());
  }
}

People can also use their own existing custom receivers (http://spark.apache.org/docs/latest/streaming-custom-receivers.html).

public class MyStreamingSource {
 
  JavaDStream<StructuredRecord> getStream(JavaSparkContext jsc) {
    JavaStreamingContext jssc = new JavaStreamingContext(jsc, Durations.seconds(conf.batchSize));
    return jssc.receiverStream(new MyCustomReceiver()).map(new MyObjectToStructuredRecordFunction());
  }
}

Note to self: Investigate support for realtimesource plugin type using http://spark.apache.org/docs/latest/streaming-programming-guide.html#updatestatebykey-operation to update SourceState

Hydrator 3.5 - Spark Streaming

Goals

Checklist

Use Cases

User Stories

Design