Skip to end of metadata
Go to start of metadata

You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 2 Next »

About

Documenting thoughts on Control flow support in Hydrator

Status

Draft

What is this feature

Hydrator pipeline has two kinds of nodes:

  • Action nodes: That represents Control flow
  • Source, Transform, Sink, Spark Compute, Aggregation: That represents Data flow

As of 3.5 Hydrator pipelines can have both Control and Data flow, however - the control flow can be present only before the source or after the sink 

We will need capabilities to add a control flow anywhere in the Hydrator pipeline 

How does this feature help our customers 
  • Having a control flow in the pipeline can help do certain validations and run branches of the pipeline 
    • Example1:  Decision node
      • Ingest twitter data collected on remote machines and perform subsequent analytics processing (aggregation) only if the number of records ingested is above 1.0M (considering average is 5KTweets/sec with some tolerance).  
      • Reasoning: Anything less than that could mean there is a data collection problem and the pipeline should not proceed
      • This will need a decision point which is a control node that can run two different branches in a pipeline 
    • Example 2: Connector node  
      • Collect customer data from Salesforce, Mysql, Legacy CRM systems normalize the data and perform subsequent processing only if the data size is > 1M records
      • This node is similar to oozie join node
Requirements
  • New plugin type and a few plugins with the following capabilities 
    • Capabilities to specify a condition based on
      • Return status of the command that is run
      • Workflow tokens 
    • Capabilities to specify two different paths in data pipeline based on the outcome
UseCase
  • Data pipeline that processes Twitter mentions data on an hourly basis, if the number of records ingested is less than 1000 per hour, then it could indicate a problem with data collection and in that case, the rest of the pipeline to parse the data and compute analytics should not be executed 

 

 

  • No labels