Skip to end of banner Go to start of banner

Control Flow support in Hydrator

Skip to end of metadata

Created by Sree Raman on Sept 17, 2016

Go to start of metadata

You are viewing an old version of this page. View the current version.

Compare with Current View Page History

About

Documenting thoughts on Control flow support in Hydrator

Status

Draft

What is this feature

Hydrator pipeline has two kinds of nodes:

Action nodes: That represents Control flow
Source, Transform, Sink, Spark Compute, Aggregation: That represents Data flow

As of 3.5 Hydrator pipelines can have both Control and Data flow, however - the control flow can be present only before the source or after the sink

We will need capabilities to add a control flow anywhere in the Hydrator pipeline

How does this feature help our customers

Having a control flow in the pipeline can help do certain validations and run branches of the pipeline
- Example1: Decision node
  - Ingest twitter data collected on remote machines and perform subsequent analytics processing (aggregation) only if the number of records ingested is above 1.0M (considering average is 5KTweets/sec with some tolerance).
  - Reasoning: Anything less than that could mean there is a data collection problem and the pipeline should not proceed
  - This will need a decision point which is a control node that can run two different branches in a pipeline
- Example 2: Connector node
  - Collect customer data from Salesforce, Mysql, Legacy CRM systems normalize the data and perform subsequent processing only if the data size is > 1M records
  - This node is similar to oozie join node

What is needed

New plugin type with
- Capabilities to run a command
- Capabilities to specify a condition based on
  - Return status of the command
  - Workflow tokens
- Capabilities to specify two different paths in data pipeline based on the outcome

No labels