(3.4) A developer should be able to create pipelines that contain aggregations (GROUP BY -> count/sum/unique)
(3.5) A developer should be able to control some parts of the pipeline running before others. For example, one source -> sink branch running before another source -> sink branch.
(3.54) A developer should be able to use a Spark ML job as a pipeline stage
(3.4) A developer should be able to rerun failed pipeline runs without reconfiguring the pipeline
(3.4) A developer should be able to de-duplicate records in a pipeline
(3.5) A developer should be able to join multiple branches of a pipeline
(3.5) A developer should be able to use an Explore action as a pipeline stage
(3.5) A developer should be able to create pipelines that contain Spark Streaming jobs
(3.5) A developer should be able to create pipelines that run based on various conditions, including input data availability and Kafka events

Note: This would also satisfy user story 5, where a unique can be implemented as a Aggregation plugin, where you group by the fields you want to unique, and ignore the Iterable<> in aggregate and just emit the group key.

Story 2: Control Flow (

...

Not Reviewed, WIP)

Option 1: Introduce different types of connections. One for data flow, one for control flow

...

Story 3: Spark ML in a pipeline

Add a plugin type "sparkMLsparksink" that is treated like a transform. But instead of being a stage inside a mapper, it is a program in a workflow. The application will create a transient dataset to act as the input into the program, or an explicit source can be givensink. When present, a spark program will be used to read data, transform it, then send all transformed results to the sparksink plugin.

Code Block

{
  "stages": [
    {
      "name": "customersTable",
      "plugin": {
        "name": "Database",
        "type": "batchsource", ...
      }
    },    
    {
      "name": "categorizer",
      "plugin": {
        "name": "SVM",
        "type": "sparkMLsparksink", ...
      }
    },
    {
      "name": "models",
      "plugin": {
        "name": "Table",
        "type": "batchsink", ...
      }
    },
  ],
  "connections": [
    { "from": "customersTable", "to": "categorizer" },
    { "from": "categorizer", "to": "models" }
  ]
}

Story 6: Join (Not Reviewed, WIP)

Add a join plugin type. Different implementations could be inner join, left outer join, etc.

...

Versions Compared

Old Version 51

New Version Current

Key

Story 2: Control Flow (

Not Reviewed, WIP)

Story 3: Spark ML in a pipeline

Story 6: Join (Not Reviewed, WIP)

Page Comparison

Versions Compared

Old Version 51

New Version Current

Key

Story 2: Control Flow (

Not Reviewed, WIP)

Story 3: Spark ML in a pipeline

Story 6: Join (Not Reviewed, WIP)