Page Comparison

Versions Compared

Old Version 1

changes.mady.by.user Sree Raman

Saved on Sept 17, 2016

compared with

New Version 2

changes.mady.by.user Sree Raman

Saved on Sept 24, 2016

Key

This line was added.
This line was removed.
Formatting was changed.

...

Having a control flow in the pipeline can help do certain validations and run branches of the pipeline
- Example1: Decision node
  - Ingest twitter data collected on remote machines and perform subsequent analytics processing (aggregation) only if the number of records ingested is above 1.0M (considering average is 5KTweets/sec with some tolerance).
  - Reasoning: Anything less than that could mean there is a data collection problem and the pipeline should not proceed
  - This will need a decision point which is a control node that can run two different branches in a pipeline
- Example 2: Connector node
  - Collect customer data from Salesforce, Mysql, Legacy CRM systems normalize the data and perform subsequent processing only if the data size is > 1M records
  - This node is similar to oozie join node

...

Requirements

New plugin type with Capabilities to run a command and a few plugins with the following capabilities
- Capabilities to specify a condition based on
  - Return status of the command that is run
  - Workflow tokens
- Capabilities to specify two different paths in data pipeline based on the outcome

UseCase

Data pipeline that processes Twitter mentions data on an hourly basis, if the number of records ingested is less than 1000 per hour, then it could indicate a problem with data collection and in that case, the rest of the pipeline to parse the data and compute analytics should not be executed