Design(WIP)

Design-1:

When user enters Debug/Preview mode after configuring the pipeline. we deploy a debug app and if deploy is successful. we run the pipeline.

user can configure the number of records (input) to be read from a stage in debug mode.

Debug config

{ 
"name" : "pipeline1-debug",
"debugMode" : "pipeline",
"numRecords": "100",
 ....
"stages": [
{
"name": "Stream",
"plugin": {
"name": "Stream",
"type": "batchsource",
"artifact": {
"name": "core-plugins",
"version": "1.3.0-SNAPSHOT",
"scope": "SYSTEM"
},
"properties": {
...
}
},
...
},
....
]
}

After deploy, we run the pipeline. onDebug mode, each stage's record output and record schema is written to a debug dataset. This allows to show the records input/output for each stage, allowing users to visualize the records and how its altered,etc.

This deploy's a debug-data-pipeline app, this data-pipeline app has a service that can serve records-in and records-out for a given stage by reading from the debug-table dataset.

Table :

debug table design

Table-name-format:
<pipeline-name>
 
RowKey-format:
<stage-name><record-id>
 
Columns:
1)record-schema
2)record-value

Running a single stage:

debug table makes it possible for the user to run/debug a single pipeline stage with the input for the stage read from the dataset.

create a config with Table source which reads from debug-table and supplies input records to the debug stage. this feature is not applicable for sources.

currently Table source reads all records, it needs improvement to read from startKey to endKey.

Single stage debug config

{
"name" : "pipeline1-debug-stage-x",
"debugMode" : "stage",
....
"stages": [
{
"name": "Table",
"plugin": {
"name": "Table",
"type": "batchsource",
"artifact": {
...
},
"properties": {
"schema": null,
"name": "pipeline1-debug",
"schema.row.field": "record",
"schema.start.key": "stage3-record1", // new
"schema.end.key": "stage3-record50", // new
}
},
...
},
{
"name": "JavaScript",
"plugin": {
"name": "JavaScript",
"type": "transform",
"label": "JavaScript",
"artifact": {
...
},
"properties": {}
},
"errorDatasetName" : "errors-debug-<ts>"
}
{
"name": "devnullSink",
"plugin": {
"name": "devnull",
"type": "sink",
"label": "sink",
"artifact": {
...
},
"properties": {}
}
}
]
}

NOTE : when debug is enabled, each stage has to write the record to the dataset after transform. this could be handled by the platform.

If user wants to change a record's data or add a record before running debug on a stage, the service in debug-data-pipeline app can be used to perform those operations.

Summary:

Pros:
1) Running the entire pipeline mimicking the actual run gives a confidence about the pipeline working after deployment.
2) Delay in deploy/run can be acceptable as its part of a debug mode and easier to accept the usage of additional containers.

Cons:
1) Debugging can take long for simple changes. as we deploy/run pipeline.

Caveats:
1) cleanup should be done appropriately. app/dataset deletion, skipping lineage/metadata during debug mode.
2) if config has errorDataset etc, those should be deleted after deploy is done. multiple debug sessions would have probably created multiple datasets and they have to be cleaned up appropriately.
3) how to hanlde metrics/logging and run-records, we probably won't delete them.
4) on re-running entire pipeline for debug, we need to trucate debug-dataset.

Design-2:

Hydrator backend app. we have explored this earlier before implementing plugin endpoints in Release3.4.

Single backend app like tracker deployed per namespace that can handle requests for debug/preview instead of one app per debug for a pipeline.

Might have to progress from one stage to next, as executing all the stages increases chance of timeout. should have async response for spark transform,etc.

the backend app can cache the plugins and can call the plugin's transform directly, avoiding repeated deploy/initialize run and avoids running MR job, but we still need to run spark jobs for executing spark transforms.

pros:
1)fastest way for running/debugging a single stage.

cons:

1) running everything in a single endpoint call can cause timeouts? calling JS transform could be okay, but running a spark job on input records will take longer and might cause timeout ?
2) deploying an app on every namespaces and running services is not ideal and increases overhead.

CDAP

Initial design options.

Design(WIP)

Design-1:

Design-2: