...
As a pipeline developer, I want to create realtime ETL pipelines that run using Spark Streaming.
As a pipeline developer, I want to be able to group events into time windows in my streaming pipeline.
As a pipeline developer, I want to perform aggregations over windows of events in my pipeline.
As a pipeline developer, I want to enrich streaming events by joining to other datasets.
As a pipeline developer, I want to join data streams in my pipeline.
- As a pipeline developer, I want to train machine learning models in my streaming pipeline.
As a plugin developer, I want to be able to create new streaming source and sink plugins.
As a plugin developer, I want my transform, aggregateaggregator, joiner, and join sparkcompute plugins to work in both Spark Streaming and Data Pipelines.
As a plugin developer, I want to be able to use features available in Spark Streaming like MLLib to write plugins.
...
Code Block |
---|
public class MyStreamingSource { JavaDStream<StructuredRecord> getStream(JavaStreamingContext jssc) { return jssc.receiverStream(new MyCustomReceiver()).map(new MyObjectToStructuredRecordFunction()); } } |
Note: Investigate support for realtimesource plugin type using http://spark.apache.org/docs/latest/streaming-programming-guide.html#updatestatebykey-operation to update SourceState
StreamingSink
The streamingsink plugin type simply takes a JavaDStream and a context and performs some operation on it:
...
Code Block |
---|
@Override public void write(JavaDStream<StructuredRecord> stream, final JavaSparkExecutionContext jsec) throws Exception { final String tableName = conf.table; stream.mapToPair(new PairFunction<StructuredRecord, byte[], Put>() { @Override public Tuple2<byte[], Put> call(StructuredRecord record) throws Exception { Put put = recordPutTransformer.toPut(record); return new Tuple2<>(put.getRow(), put); } }).foreachRDD(new VoidFunction<JavaPairRDD<byte[], Put>>() { @Override public void call(JavaPairRDD<byte[], Put> putJavaPairRDD) throws Exception { jsec.saveAsDataset(putJavaPairRDD, tableName); } }); } |
...
Unions
Multiple connections into a stage become a union of all DStreams coming into that stage.
Code Block |
---|
Iterator<JavaDStream<Object>> previousStreamsIter = previousStreams.iterator(); JavaDStream<Object> stageStream = previousStreamsIter.next(); while (previousStreamsIter.hasNext()) { stageStream = stageStream.union(previousStreamsIter.next()); } |
...
Transforms
A Transform stage will become a flatMap() call on its input DStream.
...