Goals

Checklist

User stories documented (Albert/Vinisha)
User stories reviewed (Nitin)
Design documented (Albert/Vinisha)
Design reviewed (Terence/Andreas)
Feature merged ()
Examples and guides ()
Integration tests ()
Documentation for feature ()
Blog post

Use Cases

A pipeline developer wants to create a pipeline that has several configuration settings that are not known at pipeline creation time, but that are set at the start of the each pipeline run.
1. A pipeline developer wants to create a pipeline that reads from a database source and writes to a table sink. He wants to configure the name of the database table and name of the table sink on a per run basis and he gives those values as input before starting the run.
A pipeline developer wants to create a pipeline with a custom action at the start of the run. The custom action based on some logic provides the name of the database to use as source and the name of the table to write in sink. The next stage in pipeline uses this information to read from the appropriate database source and write to the table sink.

User Stories

As a pipeline developer, I want to be able to configure a plugin property to some value that will be substituted each run based on the runtime arguments.
As a pipeline operator, I want to be able to set arguments for the entire pipeline that will be used for substitution.
As a pipeline operator, I want to be able to set arguments for a specific stage in the pipeline that will be used for substitution.
As a plugin developer, I want to be able to write a code that is executed at the start of the pipeline and sets arguments for the rest of the run.

Design

Macros Syntax

Expanded Syntax : 
${macro-type(macro)}
 
Shorthand notation:
${macro}
 
Example Usage: 
${runtime-argument(hostname)) - get hostname from runtime arguments
${wf-token(hostname)) - get hostname from workflow token
${secure(access_key)) - get access key from secure store 
${function_time(time_format)) - apply time function on the time_format provided and use the value. 
 
The Default (short-hand) usage will read from runtime arguments, having an expanded notation gives user option for using more macro types.
Examples :
ipConfig : ${hostname}:${port}
JDBC connection string : jdbc:${jdbc-plugin}://${hostname}:${sql-port}/${db-name}

The "function_time" macro function uses the logical start time of a run to perform the substitution. This is an example of a macro function that is not just a key-value lookup but allows for extra logic to be performed before a value is returned. For now, the implementation will only support the following macro functions: runtime-arguments, workflow tokens, and function_time. Once the secure store API is available, it will also support secure store. In the future, we can see if we will allow developers to create custom macro functions (similar to function_time(...)).

Pipeline Config

"stages": [
    {
        "name": "Database",
        "plugin": {
            "name": "Database",
            "type": "batchsource",
            "properties": {
				...
                "user": "${username}",
                "password": "${secure(sql-password)}",
                "jdbcPluginName": "jdbc",
                "jdbcPluginType": "${jdbc-type}",
                "connectionString": "jdbc:${jdbc-type}//${hostname}:${port}/${db-name}",
                "importQuery": "select * from ${table-name};"
            }
        }
    },
    {
        "name": "Table",
        "plugin": {
            "name": "Table",
            "type": "batchsink",                                        
            "properties": {
                "schema": "{\"type\":\"record\",\"name\":\"etlSchemaBody\",
                \"fields\":[{\"name\":\"name\",\"type\":\"string\"},
                {\"name\":\"age\",\"type\":\"int\"},{\"name\":\"emp_id\",\"type\":\"long\"}]}",
                "name": "${table-name}",
                "schema.row.field": "name"
            }
        }
    }
]

Hydrator Plugin Changes

Currently when we deploy a pipeline, configurePipeline is called on each plugin. we perform few validations in the configure stage, specifically for schema strings, syntax for scripts, etc. In some Plugins we also create a dataset if the dataset doesn't already exist.

The dataset to write to can be macro-substituted, so we have to defer dataset creation to prepareRun rather than doing this in the configure stage.

Deferring dataset creation in prepareRun will require adding a new method to BatchContext.

@Beta
public interface BatchContext extends DatasetContext, TransformContext {

// new method
void createDataset(String datasetName, String typeName, DatasetProperties properties);
 
// existing methods
long getLogicalStartTime();

/**
 * Returns runtime arguments of the Batch Job.
 *
 * @return runtime arguments of the Batch Job.
 */
Map<String, String> getRuntimeArguments();

/**
 * Updates an entry in the runtime arguments.
 *
 * @param key key to update
 * @param value value to update to
 * @param overwrite if {@code true} and if the key exists in the runtime arguments, it will get overwritten to
 *                  the given value; if {@code false}, the existing value of the key won't get updated.
 */
void setRuntimeArgument(String key, String value, boolean overwrite);

/**
 * Returns the hadoop job.
 * @deprecated this method will be removed.
 */
@Deprecated
<T> T getHadoopJob();
...
}

Currently if a stream given in stream source or table given in table source doesn't exist, we create a new stream/table. We want to allow table creation as we want to create external dataset for sources, but disallow stream creation, so we are adding only createDataset to the BatchContext.

PluginConfigurer can be made not to extend DatasetConfigurer later as we no longer want to allow creating a dataset in configure.

However there are certain fields which are used to determine the schema in the plugin and those cannot be macro-substituted as schema validation is essential during configure time and we want to disallow macro usage for them.

Platform Level Substitution:

Plugins can use an "@Macro" annotation to specify if a plugin field can be a macro and also provides a configure-value to use at configure time to instantiate the plugin.

When a plugin instance is instantiated at configure time, macros cannot be substituted as the values to substitute have not been specified yet. By default, macros will be disabled for properties. This is to prevent new plugin developers from having to worry about undefined behavior if they did not consider or are not familiar with macros.

public class TableSinkConfig extends PluginConfig {
  @Name(Properties.Table.NAME)
  @Description("Name of the table. If the table does not already exist, one will be created.")
  // The name of the table can be specified by a runtime macro, by default macros are disabled for fields.
  @Macro(enabled=true) 
  private String name;

  @Name(Properties.Table.PROPERTY_SCHEMA)
  @Description("schema of the table as a JSON Object. If the table does not already exist, one will be " +
    "created with this schema, which will allow the table to be explored through Hive. If no schema is given, the " +
    "table created will not be explorable.")
  @Nullable
  private String schemaStr;

  @Name(Properties.Table.PROPERTY_SCHEMA_ROW_FIELD)
  @Description("The name of the record field that should be used as the row key when writing to the table.")
  private String rowField;
}

Macro Annotation

@Retention(RetentionPolicy.RUNTIME)
@Target(ElementType.TYPE)
public @interface Macro {

  /**
   * Default status if macro is enabled.
   */
  boolean DEFAULT_STATUS = false;

  /**
   * Returns if macro is enabled. Default is 'false'.
   */
  boolean enabled() default DEFAULT_STATUS;

}

/**
 * Contains information about a property used by a plugin.
 */
@Beta
public class PluginPropertyField {

  private final String name;
  private final String description;
  private final String type;
  private final boolean required;
  // returns true if this field can accept macro
  private final boolean macroEnabled;
  ...
}

Notes

This will require a CDAP platform level change as its a new annotation.

PluginInstantiator has to understand if macro is enabled and set fields appropriately..

During configure time in the configurePipeline method, if a field is macro enabled, the property should not be validated as the macro has not been provided a substitutable value.

During runtime, PipelineInsantiator would get config fields and values to substitute and can use that information to substitute macro appropriately and return an instantiated plugin.

Custom Action Setting Config values:

One use case of the feature is to allow custom actions that run before a plugin to set macros. Custom actions can use workflow tokens to set values for field names.

"plugin": {
	"name": "Database",
	"type": "batchsource",
	"properties": {
		"user": "${wf-token(username)}",
		"password": "${secure(sql-password)}",
		"jdbcPluginName": "jdbc",         
		"importQuery": "select * from ${wf-token(table-name)};"
	}
}

If pipeline builder wants to use a workflow token sent from a preceding custom action to be used as value for fields, then he uses the macro-type token in his fields as above.

Context has access to the workflow token and we should be able to use workflow tokens similar to runtime arguments for substitution.

Scoping:

Because macro-substitution is performed at the platform level, users that require scoping at the stage name level must do this manually. In our example config from a JDBC source to a table sink, there is a common macro "${table-name}", if the user wants to provide a different name for the table-name in Table Sink, he can manually do this:

Syntax	Macro	Evaluates To
${table-name}	table-name	employees
${TableSink:table-name}	TableSink:table-name	employee_sql

This is more of the user creating unique argument keys as opposed to scoping.

Documentation Changes

Regardless of where the substitution occurs, the guidelines for creating Hydrator plugins would have to change. For existing plugins, any validation for properties that are macro-substitutable in configurePipeline must be moved to prepareRun (see reference section for specific plugins). We should also document that macros must be manually scoped and are not unique to individual stages.

Implementation Details

MacroContext

interface MacroContext {	
	/**
	 * Given the macro key, return the substituted value
     */ 
	String getValue(String macroKey);
}

Macro Types

Based on the macro type, one of the below MacroContext's will be used to get the value for macro. 
 
DefaultMacroContext implements MacroContext {
	Map<String, String> runtimeArguments;
	String getValue(String macroKey) {
		return runtimeArguments.get(macroKey);
	}
}

SecureMacroContext implements MacroContext {
	SecureStore secureStore;
	String getValue(String macroKey) {
		return secureStore.get(macroKey);
	}
}

RuntimeFunctionMacro implements MacroContext {	
	long logicalStartTime;
	Function<String, String> timezoneFunction;
	String getValue(String arguments) {
		return timezoneFunction.apply(arguments);
	}
}

----------------------

Setting Hydrator runtime arguments using CDAP runtime arguments/preferences

CDAP preferences and runtime arguments will be used directly as Hydrator arguments.

1.) Runtime arguments can be passed to hydrator pipeline in 2 ways:

Using Prepipeline-CustomActions:
Prepipeline custom actions can set runtime arguments. For example, before running the pipeline, custom actions can copy local files to hdfs and set runtime arguments for input path for batchsource. In order to do that, we can expose setPreferences() and getPreferences() programmatic api for setting runtime arguments. These arguments can be passed to hydrator app using workflow token.
Using Hydrator UI:
For each stage, runtime arguments can be passed from hydrator UI using cdap REST endpoints for preferences/runtime arguments framework.

2.) Hydrator app will substitute properties using Macro substitution for each ETLStage. Now, plugins, like SFTP, which need secure substitution using key management can use 'secure' prefix in the macro. Macro substitution should vary depending on prefix of the arguments. In case of secure key, macro can be '${secure(key)}', in case of value directly to be substituted, macro can be '${inputpath}' without any prefix.

----------------------------

Reference

Many plugins have properties that are used in constructing or validating a schema at configure time. These fields need to have macros disabled to allow this. The following plugins and fields would be affected:

Plugin	Fields	Use	Conflict
BatchCassandraSource	schema	Parsed for correctness to create the schema.	Parsing a macro or schema with a nested macro would fail.
CopybookSource	copybookContents	Copybook contents are converted to an InputStream and used to get external records, which are in turn used to add fields to the schema.	Schema would add macro literal as a field.
DedupAggregator	uniqueFields, filterOperation	Both fields are used to validate the input schema created.	Macro literals do not exist as fields in schema and will throw IllegalArgumentException.
DistinctAggregator	fields	Specifies the fields used to construct the output schema.	Will add macro literals as schema fields.*
GroupByAggregator	groupByFields, aggregates,	Gets fields from input schema and adds aggregates to to output fields list.	Macro literals do not exist in input schema or are valid fields for an output schema.
RowDenormalizerAggregator	keyField, nameField, valueField	Gets schemas by field names from the input schema.	Macro literals do not exist as fields in the input schema.
KVTableSink	keyField, valueField	Validates that presence and type of these fields in the input schema.	Macro literals will not exist in the input schema.
SnapshotFileBatchAvroSink	schema	Parses schema to add file properties.	Macro literals may disallow schema parsing or incorrect schema creation.
SnapshotFileBatchParquetSink	schema	Parses schema to add file properties.	Macro literals may disallow schema parsing or incorrect schema creation.
TableSink	schema, rowField	Validates output and input schemas if properties specified.	Macro literals will lead to failed validation of schema and row field.
TimePartitionedFileSetDatasetAvroSink	schema	Parses schema to add file properties.	Parsing macro literals in schema would fail.
TimePartitionedFileSetDatasetParquetSink	schema	Parses schema to add file properties.	Parsing macro literals in schema would fail.
SnapshotFileBatchAvroSource	schema	Parses schema property to set output schema.	Macro literals can lead to invalid schema parsing or creation.
SnapshotFileBatchParquetSource	schema	Parses schema property to set output schema.	Macro literals can lead to invalid schema parsing or creation.
StreamBatchSource	schema, name, format	Stream is added and created through name and schema is parsed to set output schema.	Macro literals will lead to bad parsing of properties.
TableSource	schema	Schema parsed to set output schema.	Macro literals will lead to failed or incorrect schema creation.
TimePartitionedFileSetDatasetAvroSource	schema	Schema parsed to set output schema.	Macro literals will lead to failed or incorrect schema creation.
TimePartitionedFileSetDatasetParquetSource	schema	Schema parsed to set output schema.	Macro literals will lead to failed or incorrect schema creation.
JavaScriptTransform	schema, script, lookup	Schema format is used to set the output schema. JavaScript and lookup properties are also parsed for correctness.	Macro literals can cause parsing to fail for schema creation, JavaScript compilation, or lookup parsing.
LogParserTransform	inputName	Gets field from input schema through inputName property.	With a macro literal, the field will not exist in the input schema.
ProjectionTransform	fieldsToKeep, fieldsToDrop, fieldsToConvert, fieldsToRename	Properties are used to create output schema.	Macro literals will lied to a failed or wrong output schema being created.
PythonEvaluator	schema	Schema parsed for correctness and set as output schema.	Macro literal can lead to failed or bad schema creation.
ValidatorTransform	validators, validationScript,	Validator property used to set validator plugins. Script property is also parsed for correctness.	Macro literals can lead to failed parsing or plugins being set. Scripts can not be validated without validators.
ElasticsearchSource	schema	Schema parsed for correctness and set as output schema.	Macro literals can lead to failed or incorrect schema parsing/creation.
HBaseSink	rowField, schema	Parsed to valid the output and input schemas and set the ouput schema.	Macro literals can lead to failed or incorrect schema parsing/creation.
HBaseSource	schema	Parsed for correctness to set output schema.	Macro literals can lead to failed or incorrect schema parsing/creation.
HiveBatchSource	schema	Parsed for correctness to set ouput schema.	Macro literals can lead to failed or incorrect schema parsing/creation.
MongoDBBatchSource	schema	Parsed for correctness and validated to set output schema.	Macro literals can lead to failed or incorrect schema parsing/creation.
NaiveBayesClassifier	predictionField	Configures and sets fields of output schema and checked for existence in input schema.	Output schema would be created wrongly with macro literal as prediction field and input schema check behavior is undefined.
Compressor	compressor, schema	Parsed for correctness and used to set output schema.	Macro literals can lead to failed or incorrect schema parsing/creation.
CSVFormatter	schema	Parsed for correctness and used to set output schema.	Macro literals can lead to failed or incorrect schema parsing/creation.
CSVParser	field	Validated against input schema to check existence of field.	Macro literals may not exist as fields in the input schema.
Decoder	decode, schema	Decode property is parsed and validated then used to validate the input schema. Schema parsed to set output schema.	Macro literals can lead to failed or incorrect schema parsing/creation or incorrect validation of input schema.
Decompressor	decompressor, schema	Decompressor property is parsed and validated then used to validate the input schema. Schema parsed to set output schema.	Macro literals can lead to failed or incorrect schema parsing/creation or incorrect validation of input schema.
Encoder	encode, schema	Encode property is parsed and validated then used to validate the input schema. Schema parsed to set output schema.	Macro literals can lead to failed or incorrect schema parsing/creation or incorrect validation of input schema.
JSONFormatter	schema	Parsed for correctness and used to set output schema.	Macro literals can lead to failed or incorrect schema parsing/creation.
JSONParser	field, schema	Validates if field property is present in input schema. Parses schema property to set output schema.	Macro literal may not exist in input schema and may lead to failed parsing or creation of output schema.
StreamFormatter	schema	Parsed for correctness and used to set output schema.	Macro literals can lead to failed or incorrect schema parsing/creation.

* May need verification

Other plugins have fields that are validated/processed at configure time that do not affect the schema. In these cases, these can be moved to the prepare run method. The following plugins and fields would be affected:

Plugin	Fields	Use	Justification
StreamBatchSource	duration, delay	Parsed and validated for proper formatting.	The parsing/validation is not related to the schema's creation.
TimePartitionedFileSetSource	duration, delay	Parsed and validated for proper formatting.	The parsing/validation is not related to the schema's or dataset's creation.
ReferenceBatchSink	referenceName	Verifies reference name meets dataset ID constraints.	As dataset names can be macros, this supports the primary use case.
ReferenceBatchSource	referenceName	Verifies that reference name meets dataset ID constraints.	As dataset names can be macros, this supports the primary use case.
FileBatchSource	timeTable	Creates dataset from time table property.	This is a primary use case for macros.
TimePartitionedFileSetSource	name, basePath	Name and basePath are used to create the dataset.	This is a primary use case for macros.
BatchWritableSink	name, type	Creates dataset from properties.	This is a primary use case for macros.
SnapshotFileBatchSink	name	Creates dataset from name field.	This is a primary use case for macros.
BatchReadableSource	name, type	Dataset is created from name and type properties.	This is a primary use case for macros.
SnapshotFileBatchSource	all properties*	Creates dataset from properties.	This is a primary use case for macros.
TimePartitionedFileSetSink	all properties*	Creates dataset from properties.	This is a primary use case for macros.
DBSource	importQuery, boundingQuery, splitBy, numSplits	Validate connection settings and parsed for formatting.	The parsing/validation does not lead to the creation of any schema or dataset.
HDFSSink	timeSuffix	Parsed to validate proper formatting of time suffix.	The parsing/validation does not lead to the creation of any schema or dataset.
KafkaProducer	async	Parsed to check proper formatting of boolean.	The parsing/validation does not lead to the creation of any schema or dataset.
NaiveBayesClassifier	fieldToClassify	Checked if input schema field is of type String.	The validation does not lead to the creation or alteration of any schema.
NaiveBayesTrainer	fieldToClassify, predictionField	Checked if input schema fields are of type String and Double respectively.	The validation does not lead to the creation or alteration of any schema.
CloneRecord	copies	Validated against being 0 or over the max number of copies.	The validation does not lead to the creation of any schema or dataset.
CSVFormatter	format	Validated for proper formatting.	The validation does not lead to the creation of any schema or dataset.
CSVParser	format	Validated for proper formatting.	The validation does not lead to the creation of any schema or dataset.
Hasher	hash	Checked against valid hash formats.	The check does not lead to the validation or alteration of any schema.
JSONParser	mapping	Mappings extracted and placed into a map with their expressions.	The extraction does not affect any schema creation or validation.
StreamFormatter	format	Checked against valid stream formats.	The check does not lead to the validation or alteration of any schema.
ValueMapper	mapping, defaults	Parsed after configuration is initialized and validated.	The check does not lead to the validation or alteration of any schema.