Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

  1. As a pipeline developer, I want to create realtime ETL pipelines that run using Spark Streaming.

  2. As a pipeline developer, I want to be able to group events into time windows in my streaming pipeline.

  3. As a pipeline developer, I want to perform aggregations over windows of events in my pipeline.

  4. As a pipeline developer, I want to enrich streaming events by joining to other datasets.

  5. As a pipeline developer, I want to join data streams in my pipeline.

  6. As a pipeline developer, I want to train machine learning models in my streaming pipeline.
  7. As a plugin developer, I want to be able to create new streaming source and sink plugins.

  8. As a plugin developer, I want my transform, aggregateaggregator, joiner, and join sparkcompute plugins to work in both Spark Streaming and Data Pipelines.

  9. As a plugin developer, I want to be able to use features available in Spark Streaming like MLLib to write plugins.

...

Code Block
public class MyStreamingSource {
 
  JavaDStream<StructuredRecord> getStream(JavaStreamingContext jssc) {
    return jssc.receiverStream(new MyCustomReceiver()).map(new MyObjectToStructuredRecordFunction());
  }
}

 

Note: Investigate support for realtimesource plugin type using http://spark.apache.org/docs/latest/streaming-programming-guide.html#updatestatebykey-operation to update SourceState 

StreamingSink

The streamingsink plugin type simply takes a JavaDStream and a context and performs some operation on it:

...

Code Block
@Override
public void write(JavaDStream<StructuredRecord> stream, final JavaSparkExecutionContext jsec) throws Exception {
  final String tableName = conf.table;
  stream.mapToPair(new PairFunction<StructuredRecord, byte[], Put>() {
    @Override
    public Tuple2<byte[], Put> call(StructuredRecord record) throws Exception {
      Put put = recordPutTransformer.toPut(record);
      return new Tuple2<>(put.getRow(), put);
    }
  }).foreachRDD(new VoidFunction<JavaPairRDD<byte[], Put>>() {
    @Override
    public void call(JavaPairRDD<byte[], Put> putJavaPairRDD) throws Exception {
      jsec.saveAsDataset(putJavaPairRDD, tableName);
    }
  });
}

...

Unions

Multiple connections into a stage become a union of all DStreams coming into that stage.

Code Block
Iterator<JavaDStream<Object>> previousStreamsIter = previousStreams.iterator();
JavaDStream<Object> stageStream = previousStreamsIter.next();
while (previousStreamsIter.hasNext()) {
  stageStream = stageStream.union(previousStreamsIter.next());
}

...

Transforms

A Transform stage will become a flatMap() call on its input DStream.

...