Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

  1. As a pipeline developer, I want to create realtime ETL pipelines that run using Spark Streaming.

  2. As a pipeline developer, I want to be able to group events into time windows in my streaming pipeline.

  3. As a pipeline developer, I want to perform aggregations over windows of events in my pipeline.

  4. As a pipeline developer, I want to enrich streaming events by joining to other datasets.

  5. As a pipeline developer, I want to join data streams in my pipeline.

  6. As a pipeline developer, I want to train machine learning models in my streaming pipeline.
  7. As a plugin developer, I want to be able to create new streaming source and sink plugins.

  8. As a plugin developer, I want my transform, aggregator, joiner, and sparkcompute plugins to work in both Spark Streaming and Data Pipelines.

  9. As a plugin developer, I want to be able to use features available in Spark Streaming like MLLib to write plugins that train ML models using MLLib.

Design

We will introduce a new artifact similar to the DataPipeline artifact, called the DataStreaming artifact.  It will use the exact same configuration, except it will support its own set of plugin types. This artifact will support the transform, sparkcompute, aggregator, and joiner plugin types.  In addition, we will add streamingsource, streamingsink, and streamingtransform plugin types

Each pipeline will contain a config setting called 'batchInterval', which controls how much data is contained in each RDD of the discretized stream at the source(s) of the pipeline.

StreamingSource

The streamingsource plugin type simply takes a JavaSparkContext and returns a JavaDStream:

...