Skip to end of metadata
Go to start of metadata

You are viewing an old version of this page. View the current version.

Compare with Current View Page History

Version 1 Next »

About

Documenting thoughts on Control flow support in Hydrator

Status

Draft

What is this feature

Hydrator pipeline has two kinds of nodes:

  • Action nodes: That represents Control flow
  • Source, Transform, Sink, Spark Compute, Aggregation: That represents Data flow

As of 3.5 Hydrator pipelines can have both Control and Data flow, however - the control flow can be present only before the source or after the sink 

We will need capabilities to add a control flow anywhere in the Hydrator pipeline 

How does this feature help our customers 
  • Having a control flow in the pipeline can help do certain validations and run branches of the pipeline 
    • Example1:  Decision node
      • Ingest twitter data collected on remote machines and perform subsequent analytics processing (aggregation) only if the number of records ingested is above 1.0M (considering average is 5KTweets/sec with some tolerance).  
      • Reasoning: Anything less than that could mean there is a data collection problem and the pipeline should not proceed
      • This will need a decision point which is a control node that can run two different branches in a pipeline 
    • Example 2: Connector node  
      • Collect customer data from Salesforce, Mysql, Legacy CRM systems normalize the data and perform subsequent processing only if the data size is > 1M records
      • This node is similar to oozie join node
What is needed 
  • New plugin type with 
    • Capabilities to run a command 
    • Capabilities to specify a condition based on
      • Return status of the command
      • Workflow tokens 
    • Capabilities to specify two different paths in data pipeline based on the outcome

 

 

  • No labels