Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

  1. To start the preview for an application: 
    Request Method and Endpoint 

    Code Block
    POST /v3/namespaces/{namespace-id}/apps/{app-id}/preview
    where namespace-id is the name of the namespace
          app-id is the name of the
    application for which preview is to be previewed
    
    Note that user will have to provide the app-id for the preview request.
     
    Response will contain the CDAP generated unique preview-id which can be used further to get the preview data per stage.

    Request body will contain the application configuration along with following few additional configs for the preview section.

    a.  Connection configuration in the application JSON can be updated to specify the input mock data for the destination stage of the connection.

    Code BlockConsider pipeline represented by following connections. "connections":[

    Code Block
    {
        "artifact":{ 
          "name":"cdap-data-pipeline",
          "version":"3.5.0-SNAPSHOT",
        {  "scope":"SYSTEM"
          },
         "fromname":"FTPMyPipeline",   
             "toconfig":"CSVParser"{
            "connections":[ },
             {  
                "from":"CSVParserFTP",
                "to":"TableCSVParser"
             },
          	]   { 
     Now if user wants to provide the input data to the "from":"CSVParser",
    instead of reading it from FTP source the inputData can be provided as"to":
    "connectionsTable":[
             }
     {       ],
            "from":"FTP",stages":[ 
             { 
                "toname":"CSVParserFTP",
    			"inputData"    : [       "plugin":{ 
                   "name":"FTP",
             {"offset": 1,      "bodytype": "100,bobbatchsource"},
                     "label":"FTP",
                    {"offsetartifact":{ 2,
    "body": "200,rob"},                 "name":"core-plugins",
                     { "offsetversion": 3, "body": "300,tom"}"1.4.0-SNAPSHOT",
                      "scope":"SYSTEM"
               ]    },
         },          "properties":{ 
                   "from"   "referenceName":"CSVParsermyfile",
                "to      "path":"Table/tmp/myfile"
             }      }
    	]   The above configuration indicates that connection from FTP to CSVParser},
    need to be updated to read the inputData instead of reading it from FTP Source directly.
    b. In case there are multiple sources in the pipeline or multiple stages reading from the same source and user wants to provide the mock data for all of them, then every connection emitted from the source need to have the inputData associated with it.
    Code Block
    Consider the following pipeline:
     "connections": [ "outputSchema":"{\"fields\":[{\"name\":\"offset\",\"type\":\"long\"},{\"name\":\"body\",\"type\":\"string\"}]}"
             },
             { 
                "fromname": "S3 SourceMyCSVParser",
                "toplugin":{ "Log
    Parser"         },      "name":"CSVParser",
      {             "fromtype": "Log Parsertransform",
                   "tolabel": "Group By Aggregator""CSVParser",
                 },  "artifact":{ 
         {             "fromname": "Group By Aggregator"transform-plugins",
                "to": "Aggregated Result"         },
      "version":"1.4.0-SNAPSHOT",
         {             "fromscope": "S3 SourceSYSTEM",
                "to": "Raw Logs"
            },
        ]  Now if user want to preview the pipeline but do"properties":{ now
    want to read the data from the S3 Source, connections can be updated with the inputData information as   "connectionsformat":"DEFAULT",
    [         {             "fromschema": "S3 Source",
                "to": "Log Parser",
    			"inputData": [
    							"127.0.0.1 - frank [10/Oct/2000:13:55:36 -0800] GET /apache_pb.gif HTTP/1.0 200 2326",
    							"127.0.0.1 - bob [10/Oct/2000:14:55:36 -0710] GET /apache_pb.gif HTTP/1.0 200 2326",
    							"127.0.0.1 - tom [10/Oct/2000:23:55:36 -0920] GET /apache_pb.gif HTTP/1.0 200 2326"
    						 ]	{\"type\":\"record\",\"name\":\"etlSchemaBody\",\"fields\":[{\"name\":\"id\",\"type\":\"int\"},{\"name\":\"name\",\"type\":\"string\"}]}",
                      "field":"body"
                   }
       },         {},
                "fromoutputSchema":"{\"type\": "Log Parser",
                "to": "Group By Aggregator"
            },\"record\",\"name\":\"etlSchemaBody\",\"fields\":[{\"name\":\"id\",\"type\":\"int\"},{\"name\":\"name\",\"type\":\"string\"}]}"
             },
             { 
                "fromname": "Group By AggregatorMyTable",
                "toplugin":{ "Aggregated
    Result"         },      "name":"Table",
      {             "fromtype": "S3 Sourcebatchsink",
                   "tolabel": "Raw LogsTable",
    			"inputData": [
    							"127.0.0.1 - frank [10/Oct/2000:13:55:36 -0800] GET /apache_pb.gif HTTP/1.0 200 2326",
    							"127.0.0.1 - bob [10/Oct/2000:14:55:36 -0710] GET /apache_pb.gif HTTP/1.0 200 2326",
    							"127.0.0.1 - tom [10/Oct/2000:23:55:36 -0920] GET /apache_pb.gif HTTP/1.0 200 2326"	
    						 ]
                   "artifact":{ 
                      "name":"core-plugins",
                 }     ]
    c. If user wants to use actual source for the preview, the number of records read from the source can be limited by the numOfRecords property. 
    Code Block
     "connections": [
     "version":"1.4.0-SNAPSHOT",
          {             "fromscope": "SYSTEM"S3
    Source",             "to": "Log Parser"},
    			"numOfRecords": 10
                 },  "properties":{ 
         {             "fromschema": "Log Parser",
    {\"type\":\"record\",\"name\":\"etlSchemaBody\",\"fields\":[{\"name\":\"id\",\"type\":\"int\"},{\"name\":\"name\",\"type\":\"string\"}]}",
                      "toname": "mytable"Group,
    By Aggregator"         },
            {
                "from": "Group By Aggregator","schema.row.field":"id"
                "to": "Aggregated Result" }
           },         {},
                "fromoutputSchema": "S3 Source",
                "to": "Raw Logs",
                "numOfRecords": 10,
    			"inputData": [
    							"127.0.0.1 - frank [10/Oct/2000:13:55:36 -0800] GET /apache_pb.gif HTTP/1.0 200 2326",
    							"127.0.0.1 - bob [10/Oct/2000:14:55:36 -0710] GET /apache_pb.gif HTTP/1.0 200 2326",
    							"127.0.0.1 - tom [10/Oct/2000:23:55:36 -0920] GET /apache_pb.gif HTTP/1.0 200 2326"	
    
    						 ]{\"type\":\"record\",\"name\":\"etlSchemaBody\",\"fields\":[{\"name\":\"id\",\"type\":\"int\"},{\"name\":\"name\",\"type\":\"string\"}]}",
                "inputSchema":[ 
                   { 
             }     ]   In the above example configuration, 10 records will be read from the S3 Source in order to pass them to the Log Parser, however since the inputData is specified for the connection S3 Source to Raw Logs, inputData will be passed to Raw Logs without reading from the S3 Source.

    d. If user do not want to write to the sink, following are few possible approaches:

    Code Block
    Approach a) Specify list of sinks to ignore using ignoreSinks property.
    
    "preview" : {
    				"ignoreSinks" : ["Raw Logs"]  
    			}
    
    Approach b) For each connection to the Sink we can add the ignoreConnection property and set it to true as
     "connections": [ "name":"id",
                      "type":"int",
                      "nullable":false
                   },
                   { 
                      "name":"name",
             {             "from": "S3 Source"type":"string",
                "to": "Log Parser",
    			"numOfRecords": 10      "nullable":false
             },      }
      {          ]
      "from": "Log Parser",     }
          ],
    "to": "Group By Aggregator"    "preview": {
       },         {   "startStages":      ["MyCSVParser"],
    			    "fromendStages": "Group By Aggregator",["MyTable"],
                   "touseSinks": ["Aggregated ResultMyTable"],
            },       "outputs": {
    								"FTP": {
    										"numRecords": 10,
    										"data":	[
                "from": "S3 Source",                  "to": "Raw Logs", 			{"numOfRecordsoffset": 101, 			"ignoreConnectionbody": "true100,bob"},
            }     ]  In the example configuration above, preview will write to the Aggregated Results, however would not write to the Raw Logs.
    e. Preview single stage:
    Code Block
    Consider pipeline connections:
    "connections":[ 			{"offset": 2, "body": "200,rob"},
               {               "from":"FTP",             "to":"CSVParser"			{"offset": 3, "body": "300,tom"}
             },          {               "from":"S3",
    			]
    								}
    						}	
    			}
     
        }
    }

    a.  Simple pipeline

    Code Block
    Consider simple pipeline  "to":"CSVParser"
             },
         represented by following connections.
    
    (FTP)-------->(CSV Parser)-------->(Table)
     
    CASE 1: To preview the entire pipeline:
     
    "preview": {
       {               "fromstartStages": ["CSVParserFTP"],
    			            "to"endStages": ["Table"],
             }       	]
     
    CSVParser in the above pipeline has two input connections, one from FTP and another from S3. 
    In order to preview the single stage CSVParser following configurations can be specified -
     
    "connections":[  
             {  
                "from":"FTP","useSinks": ["Table"],
                   "outputs": {
    								"FTP": {
    										"numRecords": 10,
    								}
    						}	
    			}
     
    CASE 2: To preview section of the pipeline: (CSV Parser)-------->(Table)
     
    "preview": {
                   "tostartStages": ["CSVParser"],
    			"inputData"  : "endStages": ["Table"],
                     "useSinks": ["Table"],
                   {"offsetoutputs": 1, "body": "100,bob"},{
    								"FTP": {
    										"data":	[
                                      			{"offset": 21, "body": "200100,robbob"},
                                      			{"offset": 32, "body": "300200,tomrob"},
                                   ]  			{"offset": 3, "body": "300,tom"}
        },          {               "from":"S3",
                "to":"CSVParser",
    			"inputData"  : [
                			]
    								}
    
    						}	
    			}
     
    CASE 3: To preview only single stage (CSV Parser) in the pipeline:
     
    "preview": {
                   "startStages": ["CSV Parser"],
    			   {"offsetendStages": 1, "body": "500,milo"}, 
    ["CSV Parser"],
                   "outputs": {
    								"FTP": {
    										"data":	[
                                     			{"offset": 21, "body": "600100,whitneybob"},
                                      			{"offset": 32, "body": "700200,yosemiterob"},
                                  ]   			{"offset": 3,      },"body": "300,tom"}
                {               "from":"CSVParser",   			]
    								}
    						}	
    			}
     
    CASE 4: To verify if "to":"Table"
     records are read correctly from FTP:
    "preview": {
           }       	] "previewstartStages": {
    	["FTP"],
    			   "endStages": ["CSVParserFTP"],
    }   Note that in 3.5, only one    stage can be provided as a endStages and when endStages is specified, inputData must be provided for all the incoming connections to that stage.
    f. NOT in 3.5   In order to execute the section of the pipeline, endStages can be provided.
    Code Block
    Consider the following pipeline:
     "connections": ["outputs": {
    								"FTP": {
    									"numOfRecords": 10
    								}
    						}	
    			}
     
    CASE 5: To verify the data is getting written to Table properly:
    "preview": {
            {       "startStages": ["Table"],
    			    "fromendStages": ["S3 SourceTable"],
                   "touseSinks": ["Log ParserTable"],
            },       "outputs":  {
    								"CSV Parser":           "from": "Log Parser",
                "to": "Group By Aggregator"
            },
    		{
                "from": "Group By Aggregator",
                "to": "Python Evaluator"
            },
            {
                "from": "Python Evaluator",
                "to": "Aggregated Result"{
    									"data":	[
    										{"id": 100, "name": "bob"},
    										{"id": 200, "name": "rob"},
    										{"id": 300, "name": "tom"}
    									]
    								}
    						}	
    			}

    b. Fork in the pipeline (multiple sinks)

    Code Block
    Consider the following pipeline:
     
    (S3 Source) --------->(Log Parser)--------->(Group By Aggregator)--------->(Python Evaluator)--------->(Aggregated Result)
            },         |
    			|
    			--------->(Javascript Transform)--------->(Raw Data)
     
    CASE 1: To preview entire pipeline
    "preview": {
                   "fromstartStages": ["S3 Source"],
             			   "toendStages": ["RawAggregated LogsResult", "Raw Data"],
          }     ]     (S3 Source) --------->(Log Parser)--------->(Group By Aggregator)--------->(Python Evaluator)--------->(Aggregated Result)
                |
    			|
    			--------->(Raw Data)	
     
    Now if user wants to preview (Log Parser)--------->(Group By Aggregator) section of the pipeline, endStages can be provided as Group By Aggregator using following configurations:
     
    "connections": [
            "useSinks": ["Aggregated Result", "Raw Data"], // useSinks seem redundant as endStages is there which can control till what point the pipeline need to run
                   "outputs": {
    								"S3": {
    									"numOfRecords": 10
    								}
    						}	
    			}
     
    CASE 2: To mock the source
    "preview": {
                "from": "S3 Source",   "startStages": ["Log Parser", "Javascript Transform"],
    			   "endStages": ["Aggregated Result", "Raw Data"],
                   "touseSinks": ["LogAggregated ParserResult", 			"inputData": "Raw Data"],
                   "outputs": {
    								"S3": {
    									"data": [
    												"127.0.0.1 - frank [10/Oct/2000:13:55:36 -0800] GET /apache_pb.gif HTTP/1.0 200 2326",
    												"127.0.0.1 - bob [10/Oct/2000:14:55:36 -0710] GET /apache_pb.gif HTTP/1.0 200 2326",
    												"127.0.0.1 - tom [10/Oct/2000:23:55:36 -0920] GET /apache_pb.gif HTTP/1.0 200 2326"
    						 					]
    								}
    						}	
    			}
     
     
    CASE 3: To preview the },section of the pipeline (Log Parser)--------->(Group By Aggregator)--------->(Python Evaluator)
    "preview": {
                   "fromstartStages": ["Log Parser"],
       			         "to"endStages": ["Aggregated Result"Group],
     By Aggregator"         },    "useSinks": ["Aggregated Result"],
      {             "fromoutputs": "Group By Aggregator",
                "to": "Aggregated Result"
            },
            {
                "from": "Python Evaluator",
                "to": "Aggregated Result"
            },{
    								"S3": {
    									"data": [
    												"127.0.0.1 - frank [10/Oct/2000:13:55:36 -0800] GET /apache_pb.gif HTTP/1.0 200 2326",
    												"127.0.0.1 - bob [10/Oct/2000:14:55:36 -0710] GET /apache_pb.gif HTTP/1.0 200 2326",
    												"127.0.0.1 - tom [10/Oct/2000:23:55:36 -0920] GET /apache_pb.gif HTTP/1.0 200 2326"
    						 					]
    								}
    						}	
    			}
    
    
    
    CASE 4: To preview the single stage Python Evaluator
    "preview": {
              {     "startStages": ["Python    Evaluator"],
    			   "fromendStages": ["S3Python SourceEvaluator"],
                   "tooutputs": "Raw Logs"
            }
        ]
    
    "preview": {
    	"endStages": ["Group By Aggregator"],
        "ignoreSinks": ["Raw Logs"]    
    }
     
      
    However consider the modified version of the above pipeline:
     
     "connections": [
            {
                "from": "S3 Source",
                "to": "Log Parser"
            },
            {
                "from": "Log Parser",
                "to": "Group By Aggregator"{
    								"Group By Aggregator": {
    									"data": [
    												{"ip":"127.0.0.1", "counts":3},
    												{"ip":"127.0.0.2", "counts":4},
    												{"ip":"127.0.0.3", "counts":5},
    												{"ip":"127.0.0.4", "counts":6},
    						 					]
    								}
    						}	
    			}
    
    

    c. Join in the pipeline (multiple sources)

    Code Block
    Consider the following pipeline:
    (Database)--------->(Python Evaluator)--------->
             },         {										| 
    												|------------>(Join)-------->(Projection)------->(HBase Sink)
    												|	
               (FTP)--------->(CSV Parser)--------->
     
    "from": "Group By Aggregator",
                "to": "Aggregated Result" 
    CASE 1: To preview entire pipeline
     
    "preview": {
             },      "startStages": ["Database",  {"FTP"],
    			             "from"endStages": ["S3HBase SourceSink"],
       			         "to"useSinks": ["JavascriptHBase TransformSink"],	
            },        "outputs": {
                "from": "Javascript Transform",
                "to": "Raw Logs"
            }
        ]
     
     
    (S3 Source) --------->(Log Parser)--------->(Group By Aggregator)--------->(Python Evaluator)--------->(Aggregated Result)
                |
    			|
    			--------->(Javascript Transform)---------->(Raw Logs)
     
    In this case user still wants to execute the section (Log Parser)--------->(Group By Aggregator) of the pipeline.
    OPEN QUESTION: How to avoid execution of branch (S3 Source)--------->(Javascript Transform)---------->(Raw Logs)
    One way is to add ignoreConnection property for the (S3 source)--------->(Javascript Transform) connection								"Database": {
    									"numOfRecords": 10
    								},
    								"FTP": {
    									"numOfRecords": 20
    								}
    						}	
    			}
    
    
    CASE 2: To mock both sources
    "preview": {
                   "startStages": ["Python Evaluator", "CSV Parser"],
    			   "endStages": ["HBase Sink"],
    			   "useSinks": ["HBase Sink"],	
                   "outputs": {
    								"Database": {
    									"data": [
    												{"name":"tom", "counts":3},
    												{"name":"bob", "counts":4},
    												{"name":"rob", "counts":5},
    												{"name":"milo", "counts":6},
    						 					]
    								},
    								"FTP": {
    									"data": [
    												{"offset":1, "body":"tom,100"},
    												{"offset":2, "body":"bob,200"},
    												{"offset":3, "body":"rob,300"},
    												{"offset":4, "body":"milo,400"},
    						 					]
    								}
    						}	
    			}
     
     
    CASE 3: To preview JOIN transform only
    "preview": {
                   "startStages": ["JOIN"],
    			   "endStages": ["JOIN"],
                   "outputs": {
    								"CSV Parser": {
    									"data": [
    												{"name":"tom", "id":3},
    												{"name":"bob", "id":4},
    												{"name":"rob", "id":5},
    												{"name":"milo", "id":6},
    						 					]
    								},
    								"Python Evaluator": {
    									"data": [
    												{"id":1, "last_name":"hardy"},
    												{"id":2, "last_name":"miller"},
    												{"id":3, "last_name":"brosnan"},
    												{"id":4, "last_name":"yellow"},
    						 					]
    								}
    						}	
    			}
    
    

    d. In order to preview either single transform or sink, it should have at least one incoming connection.

    Code Block
    Consider the pipeline containing only one transform which has no connections yet- 
     
    (Javascript Transform)
     
    Preview for this would fail since there is no incoming connection provided for the transform.
     
    Pipeline can be modified to add incoming connection as
     
    (CSV Parser)------------>(Javascript Transform)
     
    Now the preview configurations can be provided as
     
    "preview": {
                   "startStages": ["Javascript Transform"],
    			   "endStages": ["Javascript Transform"],
                   "outputs": {
    								"CSV Parser": {
    									"data": [
    												{"name":"tom", "id":3},
    												{"name":"bob", "id":4},
    												{"name":"rob", "id":5},
    												{"name":"milo", "id":6},
    						 					]
    								}
    						}	
    			}
    
    
    Note that we cannot solve this problem by having the different preview configuration property as "inputs" for the single stage when no connections are specified. How will this work if user just drops JOIN transform on the UI? We will not know in advance how many input connections the JOIN takes.
  2. Once the preview is started, the unique preview-id will be generated for it. The runtime information (<preview-id, STATUS) for the preview will be generated and will be stored (in-memory or disk). 

  3. Once the preview execution is complete, its runtime information will be updated with the status of the preview (COMPLETED or FAILED).

  4. To get the status of the preview
    Request Method and Endpoint

    Code Block
    GET /v3/namespaces/{namespace-id}/apps/{app-id}/previews/{preview-id}/status
    where namespace-id is the name of the namespace
          app-id is the name of the application for which preview data is to be requested
    	  preview-id is the id of the preview for which status is to be requested

    Response body will contain JSON encoded preview status and optional message if the preview failed.

    Code Block
    1. If preview is RUNNING
    {
    	"status": "RUNNING" 
    }
    
    2. If preview is COMPLETED
    {
    	"status": "COMPLETED" 
    }
    
    3. If preview FAILED
    {
    	"status": "FAILED"
        "errorMessage": "Preview failure root cause message." 
    }
  5. To get the preview data for stage:
    Request Method and Endpoint

    Code Block
    GET /v3/namespaces/{namespace-id}/apps/{app-id}/previews/{preview-id}/stages/{stage-name}
    where namespace-id is the name of the namespace
          app-id is the name of the application for which preview data is to be requested
    	  preview-id is the id of the preview for which data is to be requested
          stage-name is the unique name used to identify the stage

    Response body will contain JSON encoded input data and output data for the stage as well as input and output schema.

    Code Block
    {
    	"inputData": [
                     	{"first_name": "rob", "zipcode": 95131},
                        {"first_name": "bob", "zipcode": 95054},
                        {"first_name": "tom", "zipcode": 94306}
                     ],
    	"outputData":[
    					{"name": "rob", "zipcode": 95131, "age": 21},
                        {"name": "bob", "zipcode": 95054, "age": 22},
                        {"name": "tom", "zipcode": 94306, "age": 23}
    				 ],
    	"inputSchema": {
    						"type":"record",
    						"name":"etlSchemaBody",
    						"fields":[
    									{"name":"first_name", "type":"string"},
    									{"name":"zipcode", "type":"int"}
    								 ]
    					},
    	"outputSchema": {
    						"type":"record",
    						"name":"etlSchemaBody",
    						"fields":[
    									{"name":"name", "type":"string"},
    									{"name":"zipcode", "type":"int"},
    									{"name":"age", "type":"int"}
    								 ]
    					}  
    }
  6. To get the logs/metrics for the preview:
    Request Method and Endpoint

    Code Block
    GET /v3/namespaces/{namespace-id}/apps/{app-id}/previews/{preview-id}/logs
    GET /v3/namespaces/{namespace-id}/apps/{app-id}/previews/{preview-id}/metric
    where namespace-id is the name of the namespace
          app-id is the name of the application for which preview data is to be requested
    	  preview-id is the id of the preview for which data is to be requested

    Response would be similar to the regular app.

...