Skip to end of metadata
Go to start of metadata

You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 24 Next »

Goals

Checklist

  • User stories documented (Albert/Vinisha) 
  • User stories reviewed (Nitin)
  • Design documented (Albert/Vinisha)
  • Design reviewed (Terence/Andreas)
  • Feature merged ()
  • Examples and guides ()
  • Integration tests () 
  • Documentation for feature ()
  • Blog post

Use Cases

  1. A pipeline developer wants to create a pipeline that has several configuration settings that are not known at pipeline creation time, but that are set at the start of the each pipeline run. For example, the time partition(s) that should be read by the source, and the name of the dataset sink, need to be set at a per-run basis.  The arguments can be set either through CDAP runtime arguments/preferences, or by the pipeline itself. For example, at the start of the run, the pipeline performs some action (ex: queries a dataset or makes an http call) to lookup which time partitions should be read, and where data should be written to, for that pipeline run. Alternatively, a user can manually specify the time partitions through CDAP runtime arguments/preferences then start the run.

User Stories

  1. As a pipeline developer, I want to be able to configure a plugin property to some value that will get substituted for each run based on the runtime arguments
  2. As a pipeline operator, I want to be able to set arguments for the entire pipeline that will be used for substitution
  3. As a pipeline operator, I want to be able to set arguments for a specific stage in the pipeline that will be used for substitution
  4. As a plugin developer, I want to be able to write a code that is executed at the start of the pipeline and sets arguments for the rest of the run.

Design (WIP - dont review yet)

Specifying Macros

We can introduce macro syntax that can be used in plugin configs that the Hydrator app will substitute before any plugin code is run. For example:

{
  "stages": [
    {
      "name": "customers",
      "plugin": {
        "name": "File",
        "type": "batchsource",
        "properties": {
          "path": "hdfs://host:port/${inputpath}" // ${inputpath} will get replaced with the value of the 'customers.inputpath' runtime argument
        }
      }
    },
    {
      "name": "items",
      "plugin": {
        "name": "File",
        "type": "batchsource",
        "properties": {
          "path": "hdfs://host:port/${inputpath}" // ${inputpath} will get replaced with the value of the 'items.inputpath' runtime argument
        }
      }
    }
  ]
}

Changes to Existing Plugins (WIP)

Many plugins have fields (configurable properties) that are used in constructing a schema at configure time. These fields need to have macros disabled. The following plugins would be affected:

PluginFieldsNotes
BatchCassandraSourceschemaThe schema is parsed for correctness.
RealtimeCassandraSinkaddressesAddresses are parsed at configure time. Parsing a macro would fail.
CopybookSourcecopybookContents

Copybook contents are converted to an InputStream and used to get external records, which are in turn used to add fields to the schema.

DedupAggregatoruniqueFields, filterOperationBoth fields are used to validate the input schema created

 

 


Setting Hydrator runtime arguments using CDAP runtime arguments/preferences

CDAP preferences and runtime arguments will be used directly as Hydrator arguments. 

1.) Runtime arguments can be passed to hydrator pipeline in 2 ways:

  1. Using Prepipeline-CustomActions:
    Prepipeline custom actions can set runtime arguments. For example, before running the pipeline, custom actions can copy local files to hdfs and set runtime arguments for input path for batchsource. In order to do that, we can expose setPreferences() and getPreferences() programmatic api for setting runtime arguments. These arguments can be passed to hydrator app using workflow token. 
  2. Using Hydrator UI:
    For each stage, runtime arguments can be passed from hydrator UI using cdap REST endpoints for preferences/runtime arguments framework. 

2.) Hydrator app will substitute properties using Macro substitution for each ETLStage. Now, plugins, like SFTP, which need secure substitution using key management can use 'secure' prefix in the macro. Macro substitution should vary depending on prefix of the arguments. In case of secure key, macro can be '$secure.key', in case of value directly to be substituted, macro can be '$inputpath' without any prefix. 

 

 

Thoughts from Terence:

Below are the thoughts I have so far.
1. Preferences/runtime arguments substitution for configuration values
  - Can start with simple $var substitution
  - The DataPipeline app performs the substitution
  - The perferences can be scoped
    - Properties prefixed with the plugin name (stage name?) will be striped
    - Property in more specific scope will override the less specific one
     - e.g. If having both "password" => "a" and "plugin1.password" => "b" in perferences, then for Plugin "plugin1", it will see "password" => "b"
  - For managing passphase so that plugin config will only contains key name, but not the actual key
  - Plugins that need sensitive information need to be adjusted to use the key management
  - Potentially can have the DataPipeline app do the substitution as well
    - But we cannot use "$", since it's used above. Maybe can be "#".
      - E.g. for plugin config {"password" => "#dbpassword"}, then at runtime the actual password with name "dbpassword" will be fetched from the secure store.

 

 

  • No labels