About
Documenting thoughts on Control flow support in Hydrator
Status
Draft
What is this feature
Hydrator pipeline has two kinds of nodes:
- Action nodes: That represents Control flow
- Source, Transform, Sink, Spark Compute, Aggregation: That represents Data flow
As of 3.5 Hydrator pipelines can have both Control and Data flow, however - the control flow can be present only before the source or after the sink
We will need capabilities to add a control flow anywhere in the Hydrator pipeline
How does this feature help our customers
- Having a control flow in the pipeline can help do certain validations and run branches of the pipeline
- Example1: Decision node
- Ingest twitter data collected on remote machines and perform subsequent analytics processing (aggregation) only if the number of records ingested is above 1.0M (considering average is 5KTweets/sec with some tolerance).
- Reasoning: Anything less than that could mean there is a data collection problem and the pipeline should not proceed
- This will need a decision point which is a control node that can run two different branches in a pipeline
- Example 2: Connector node
- Collect customer data from Salesforce, Mysql, Legacy CRM systems normalize the data and perform subsequent processing only if the data size is > 1M records
- This node is similar to oozie join node
- Example1: Decision node
Requirements
- New plugin type and a few plugins with the following capabilities
- Capabilities to specify a condition based on
- Return status of the command that is run
- Workflow tokens
- Capabilities to specify two different paths in data pipeline based on the outcome
- Capabilities to specify a condition based on
UseCase
- Data pipeline that processes Twitter mentions data on an hourly basis, if the number of records ingested is less than 1000 per hour, then it could indicate a problem with data collection and in that case, the rest of the pipeline to parse the data and compute analytics should not be executed