Goals
Checklist
...
1.) For each stage, runtime arguments can be passed from hydrator UI. As hydrator pipeline can have multiple phases, instead of using runtime arguments from cdap, we can use preferences to store hydrator runtime arguments. Preferences for the hydrator app can be set using following cdap REST end point.
PUT <base-url>/namespaces/<namespace>/apps/<app-id>/preferences
2.) Hydrator app will substitute properties using Macro substitution for each ETLStage. To substitute, we can use Macro api. We already have it in hydrator.
...
- No programmatic way to set RuntimeArguments for hydrator because stage.prepare() is called after instantiating a stage and to instantiate a stage, we will need stage properties.
Thoughts from Terence:
Below are the thoughts I have so far.
1. Preferences/runtime arguments substitution for configuration values
- Can start with simple $var substitution
- The DataPipeline app performs the substitution
- The perferences can be scoped
- Properties prefixed with the plugin name (stage name?) will be striped
- Property in more specific scope will override the less specific one
- e.g. If having both "password" => "a" and "plugin1.password" => "b" in perferences, then for Plugin "plugin1", it will see "password" => "b"
2. Secure store support
- For managing passphase so that plugin config will only contains key name, but not the actual key
- Plugins that need sensitive information need to be adjusted to use the key management
- Potentially can have the DataPipeline app do the substitution as well
- But we cannot use "$", since it's used above. Maybe can be "#".
- E.g. for plugin config {"password" => "#dbpassword"}, then at runtime the actual password with name "dbpassword" will be fetched from the secure store.
3. Expression computation
- Evaluate by the DataPipeline app at runtime when instantiating plugin
- Evaluation result will be used as the plugin config value
- JS expression
- May need to expose some predefined variables (e.g. logicalStartTime)
- Should limit to evaluation of config values
- Per record expression evaulation would be too slow and easily misused. Shouldn't encourage.
- For per record expression computation (e.g computing HBase row key), should encourage to use JS transform
- For performance reason
- Inserting an extra JS transform to augment record (e.g. add field, remove field, combine/recompute fields) should be easy.
- May need better support of schema propagation (since this JS transform won't have a fix in/out schema)
4. Pre/Post custom action hook
- Need to define the Hydrator API for that so that plugin can be written
- Need to get CDAP-4648 resolved. Specifically I would like to replace WorkflowAction with Worker in Workflow.
- UI needs to be adjust
- To support compiled code, user can write custom plugin for custom action and uses it in Hydrator
- When resolving plugin config values (point 1) for the plugins used in the execution engine (MR or Spark program), the resolution can combine both perferences and WorkflowToken, which the values in WorkflowToken has higher precedence.
- We can have out of the box custom actions for common actions
- email
- make call to REST (with parameters resolution in point 1 and 3)
- make call to CDAP service (this is slightly different than REST because it involves discovery)
- May need to enable making cross namespace service call
- JS custom action
- Exposes the WorkflowToken for the JS to update
5. Fork-join and condition
- Underlying Workflow only supports fork-join and condition at the node level
- i.e. fork and runs two MR/Spark in parallel. Optionally execute a MR/Spark
- How to support fork-join and condition at record level? We had some discussion between Andreas, Albert and me on how. Need to finalize the design.
- Would involves modifying the JSON structure between UI and the app.