Hydrator Application Configurations

This document describes the configuration changes required in Hydrator for preview mode.

We will have separate "appPreviewConfig" section in the config json which will contain preview specific configurations.

Configurations:

Following are the possible configuration parameters for the hydrator application:

endStages(Optional): List of stages(inclusive) till which preview need to be executed. If not given, preview will be executed till the sink.
realLookups: List of datasets to be used from the real space for the lookup purpose only. This is required field, since if it is not given the datasets from the preview space will be used which will be empty.
realExternalWrites(Optional): List of EXTERNAL datasets to which the actual writes need to happen. If user does not want to write to the external dataset, we can replace the sink with the /dev/null sink in the pipeline configurations. Note that this is only required for the external datasets. The CDAP internal datasets will be created in the preview space only, so even if we write to them, those writes will not be visible in the real space.
numRecords(Optional): Number of records to be read from the source for the preview. If not given we will read pre-configured number of records(say 100) from the source.
outputs(Optional): User can provide test data to preview using outputs configurations.

Lets walk through few example configurations:

Example 1: Consider a simple pipeline

Preview entire pipeline. Read from File source and write to the database.

a. 
"appPreviewConfig": {
   "realLookups": ["File"],
   "realExternalWrites": ["DBSink"]
}
 
b. Same can be achieved by
"appPreviewConfig": {
   "endStages": ["DBSink"],
   "realLookups": ["FTP"],
   "realExternalWrites": ["DBSink"]
}

Preview pipeline with test records (in JSON format) instead of reading it from File.

"appPreviewConfig": {
   "endStages": ["DBSink"],
   "realLookups": ["File"],
   "realExternalWrites": ["DBSink"],
   "outputs": {
      "File": {
         "data": [
            {"offset": 1, "body": "100,bob"},
            {"offset": 2, "body": "200,rob"},
            {"offset": 3, "body": "300,tom"}
         ]
      }
   }
}
 
Above pipeline uses test records provided in the outputs section instead of reading from the actual Files.

Preview pipeline without writing to the Database.

"appPreviewConfig": {
   "endStages": ["DBSink"],
   "realLookups": ["File"],
   "outputs": {
      "File": {
         "data": [
            {"offset": 1, "body": "100,bob"},
            {"offset": 2, "body": "200,rob"},
            {"offset": 3, "body": "300,tom"}
         ]
      }
   }
}
 
Since the realExternalWrites is not provided, app will be reconfigured to write to /dev/null

Preview single stage CSVParser

"appPreviewConfig": {
   "endStages": ["CSVParser"], // Note that endStages is updated to CSVParser to just preview the single stage.
   "realLookups": ["File"],
   "outputs": {
      "File": {
         "data": [
            {"offset": 1, "body": "100,bob"},
            {"offset": 2, "body": "200,rob"},
            {"offset": 3, "body": "300,tom"}
         ]
      }
   }
}

Example 2: Consider another LogsAggregateDatapipeline

Preview entire pipeline

"appPreviewConfig": {
   "endStages": ["Raw Logs", "Aggregated Result"], // Optional
   "realLookups": ["S3 Source"]
}
 
Note that since Raw Logs and Aggregated Results are stored in the CDAP datasets, they will be in PREVIEW space and will not be visible in the REAL space.

Preview single stage Group By Aggregator -

"appPreviewConfig": {
   "endStages": ["Group By Aggregator"],
   "realLookups": ["S3 Source"],
   "outputs": {
      "Log Parser": {
         "data" : [
            {
               "uri":"/ajax/planStatusHistoryNeighbouringSummaries.action?planKey=COOP-DBT&buildNumber=284&_=1423341312519",
               "ip": "69.181.160.120",
               "browser": "Mozilla",
               "device": "AppleWebKit",
               "httpStatus": 200, 
               "ts": 123456789
            },
            {
               "uri":"/rest/api/latest/server?_=1423341312520",
               "ip": "69.181.160.120",
               "browser": "Mozilla",
               "device": "AppleWebKit",
               "httpStatus": 200, 
               "ts": 1423346540
            },
            {
               "uri":"/ajax/planStatusHistoryNeighbouringSummaries.action?planKey=COOP-DBT&buildNumber=284&_=1423341312527",
               "ip": "69.181.160.120",
               "browser": "Mozilla",
               "device": "AppleWebKit",
               "httpStatus": 200, 
               "ts": 1423341312
            }
         ] 
      }
   }
}

Previewing section of the pipeline S3 Source -> Raw Logs

"appPreviewConfig": {
   "endStages": ["Raw Logs"],
   "realLookups": ["S3 Source"]
}
 
Note that S3 Source -> Log Parser -> Group By Aggregator -> Aggregated Result section of the pipeline will not get executed.

Algorithm to figure which part of the pipeline would run:

Look at the endStages. List of endStages defaults to the sinks in the pipeline if not provided.
Traverse the connections from each of the end stage towards source.
Stop traversing if the outputs section appears or the Source.
DAG formed from the connections while traversing will be executed as a part of preview.
If multiple DAGs are created as a part of traversing process then the configurations will be marked as INVALID.

Consider following preview configuration for the LogsAggregateDatapipeline -

"appPreviewConfig": {
   "endStages": ["Aggregated Results", "Raw Logs"],
   "realLookups": ["S3 Source"],
   "outputs": {
      "Log Parser": {
         "data" : [
            ...
         ] 
      }
   }
}

As per algorithm we start from the endStages and traverse towards Source. We start with "Aggregated Results".
We traverse back till "Log Parser" since the outputs is provided for it. This traversal yields us section of the pipeline: Log Parser -> Group by Aggregator -> Aggregated Result
Next end stage is "Raw Logs". We start traversing for it and reach the "S3 Source". This traversal yields section of the pipeline S3 Source -> Raw Logs
Since two sections are disjoint, this is invalid pipeline.