Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

  • Having a control flow in the pipeline can help do certain validations and run branches of the pipeline 
    • Example1:  Decision node
      • Ingest twitter data collected on remote machines and perform subsequent analytics processing (aggregation) only if the number of records ingested is above 1.0M (considering average is 5KTweets/sec with some tolerance).  
      • Reasoning: Anything less than that could mean there is a data collection problem and the pipeline should not proceed
      • This will need a decision point which is a control node that can run two different branches in a pipeline 
    • Example 2: Connector node  
      • Collect customer data from Salesforce, Mysql, Legacy CRM systems normalize the data and perform subsequent processing only if the data size is > 1M records
      • This node is similar to oozie join node

...

Requirements
  • New plugin type with Capabilities to run a command and a few plugins with the following capabilities 
    • Capabilities to specify a condition based on
      • Return status of the command that is run
      • Workflow tokens 
    • Capabilities to specify two different paths in data pipeline based on the outcome
UseCase
  • Data pipeline that processes Twitter mentions data on an hourly basis, if the number of records ingested is less than 1000 per hour, then it could indicate a problem with data collection and in that case, the rest of the pipeline to parse the data and compute analytics should not be executed