...
- Having a control flow in the pipeline can help do certain validations and run branches of the pipeline
- Example1: Decision node
- Ingest twitter data collected on remote machines and perform subsequent analytics processing (aggregation) only if the number of records ingested is above 1.0M (considering average is 5KTweets/sec with some tolerance).
- Reasoning: Anything less than that could mean there is a data collection problem and the pipeline should not proceed
- This will need a decision point which is a control node that can run two different branches in a pipeline
- Example 2: Connector node
- Collect customer data from Salesforce, Mysql, Legacy CRM systems normalize the data and perform subsequent processing only if the data size is > 1M records
- This node is similar to oozie join node
- Example1: Decision node
...
Requirements
- New plugin type with Capabilities to run a command and a few plugins with the following capabilities
- Capabilities to specify a condition based on
- Return status of the command that is run
- Workflow tokens
- Capabilities to specify two different paths in data pipeline based on the outcome
- Capabilities to specify a condition based on
UseCase
- Data pipeline that processes Twitter mentions data on an hourly basis, if the number of records ingested is less than 1000 per hour, then it could indicate a problem with data collection and in that case, the rest of the pipeline to parse the data and compute analytics should not be executed