Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

  1. To start the preview for an application: 
    Request Method and Endpoint 

    Code Block
    POST /v3/namespaces/{namespace-id}/apps/{app-id}/preview
    where namespace-id is the name of the namespace
          app-id is the name of the application for which preview is to be seen
     
    Response will contain the CDAP generated unique preview_-id which can be used further to get the preview data per stage.

    Request body will contain the application configuration along with following few additional configs for the preview section.

    Code Block
    1.{
    Connection configuration in the application JSON will be updated to specify the inputData.
    Consider following sample DAG.
    "connections":[ "artifact":{ 
          "name":"cdap-data-pipeline",
          "version":"3.5.0-SNAPSHOT",
        {  "scope":"SYSTEM"
          },
         "fromname":"FTPMyPipeline",
        
           "toconfig":"CSVParser"{
            "connections":[ },
             { 
                 "from":"CSVParserFTP",
                "to":"TableCSVParser"
             },
             { 	]
     Now if user wants to provide the input data to the "from":"CSVParser",
    instead of reading it from FTP source the inputData can be provided as"to":
    "connectionsTable":[
             }
      {      ],
            "from":"FTP",stages":[ 
             { 
                "toname":"CSVParserFTP",
    			"inputData"
            : [   "plugin":{ 
                   "name":"FTP",
                 {"offset": 1, "bodytype": "100,bobbatchsource"},
                    "label":"FTP",
                     {"offsetartifact":{ 
    2, "body": "200,rob"},                "name":"core-plugins",
                      {"offsetversion": 3, "body": "300,tom"}"1.4.0-SNAPSHOT",
                      "scope":"SYSTEM"
               ]          },
             {      "properties":{ 
           "from":"CSVParser",
                "toreferenceName":"Tablemyfile",
             }       	]  
    The above configuration indicates that connection from FTP to CSVParser need to be updated to read the inputData instead of reading it from FTP.
     
    2. In case there are multiple sources in the pipeline and user wants to provide mock data for all of them, then every connection emitted from the source will need to be updated with the inputData.
    3. If user wants to read the actual data from the S3 Source as a part of preview, numOfRecords can be specified for each connection emitted from the Source as
     
     "connections": [ "path":"/tmp/myfile"
                   }
                },
                "outputSchema":"{\"fields\":[{\"name\":\"offset\",\"type\":\"long\"},{\"name\":\"body\",\"type\":\"string\"}]}"
             },
             { 
                "name":"MyCSVParser",
                "plugin":{ 
                   {"name":"CSVParser",
                   "fromtype": "S3 Sourcetransform",
                   "tolabel": "Log ParserCSVParser",
    			"numOfRecords": 10            },   "artifact":{ 
        {              "fromname": "Log Parser"transform-plugins",
                      "toversion": "Group By Aggregator""1.4.0-SNAPSHOT",
              },        "scope":"SYSTEM"
    {             "from": "Group By Aggregator"},
                "to":   "Aggregated Resultproperties":{ 
           },         {  "format":"DEFAULT",
              "from": "S3 Source",             "to": "Raw Logs",
                "numOfRecords": 10,
    			"inputData": {
    							"127.0.0.1 - frank [10/Oct/2000:13:55:36 -0800] GET /apache_pb.gif HTTP/1.0 200 2326",
    							"127.0.0.1 - bob [10/Oct/2000:14:55:36 -0710] GET /apache_pb.gif HTTP/1.0 200 2326",
    							"127.0.0.1 - tom [10/Oct/2000:23:55:36 -0920] GET /apache_pb.gif HTTP/1.0 200 2326"	
    						 }"schema":"{\"type\":\"record\",\"name\":\"etlSchemaBody\",\"fields\":[{\"name\":\"id\",\"type\":\"int\"},{\"name\":\"name\",\"type\":\"string\"}]}",
                      "field":"body"
                   }
    
       ]  In the above example configuration, 10 records will be},
    read from the S3 Source in order to pass them to the Log Parser, however since the inputData is specified for the connection S3 Source to Raw Logs, inputData will be passed to Raw Logs without reading from the S3 Source.
     
    4. If user want to ignore writing to the sink, following are few options for having it as a configuration:
     
    "preview": { "outputSchema":"{\"type\":\"record\",\"name\":\"etlSchemaBody\",\"fields\":[{\"name\":\"id\",\"type\":\"int\"},{\"name\":\"name\",\"type\":\"string\"}]}"
             },
             { 
                "name":"MyTable",
                "plugin":{ 
                   "numRecordsname" : "10Table",
                   "startStagetype" : "stage_1batchsink",
                   "endStagelabel"   : "stage_3Table",
                   "inputDataartifact":{ 
     :  [               "name":"core-plugins",
                      {"nameversion": "rob"1.4.0-SNAPSHOT",
     "address": "san jose"},                "scope":"SYSTEM"
                   },
      {"name": "bob", "address": "santa clara"},          "properties":{ 
                           {"name": "tom", "address": "palo alto"}
                                  ]
    			   "programType": "WORKFLOW", // programType and programName can be optional for now. However in future if we want to preview non-hydrator application, then programType and programName can be provided to let preview system know which program to be previewed
    			   "programName": "DataPipelineWorkflow"		"schema":"{\"type\":\"record\",\"name\":\"etlSchemaBody\",\"fields\":[{\"name\":\"id\",\"type\":\"int\"},{\"name\":\"name\",\"type\":\"string\"}]}",
                      "name":"mytable",
                      "schema.row.field":"id"
                   }
      }   Description: 1. numRecords: Number of records to preview 2.},
    startStage: Pipeline stage from which preview need to be started. 3. endStage: Pipeline stage till which the preview need to be run.
    4. inputData: Data which need to be run through preview process.
     
    Validations to be performed:
    1. startStage and endStage are connected together.
    2. Schema of the inputData should match with the input schema for stage_1.
    3. If SOURCE plugin is specified as startStage, preview will ignore the inputData and read the records directly from the source as specified by the numRecords.
    4. If startStage is other than the Source plugin then inputData is required. Preview will process the inputData ignoring numRecords.

    Consider the pipeline which has FTP source, CSV parser labeled as MyCSVParser and Table sink labeled as MyTable. The configuration with preview data will look like following:

    Code Block
    {
    	"artifact":{  "outputSchema":"{\"type\":\"record\",\"name\":\"etlSchemaBody\",\"fields\":[{\"name\":\"id\",\"type\":\"int\"},{\"name\":\"name\",\"type\":\"string\"}]}",
                "inputSchema":[ 
                   { 
                      "name":"id",
                       "nametype":"cdap-data-pipelineint",
          "version":"3.5.0-SNAPSHOT",        "scope    "nullable":"SYSTEM"false
        },     "name":"MyPipeline",	 	"config":{ 		"connections":[    },
           {        { 
         "from":"FTP",             "toname":"CSVParsername",
             },          {"type":"string",
                      "fromnullable":"CSVParser",
    false
               "to":"Table"    }
         }       	],
    		"stages":[
             }
     {       ],
           "namepreview":"FTP", {
               "plugin":{    "startStages": ["MyCSVParser"],
    			            "name"endStages": ["FTPMyTable"],
                   "typeuseSinks": ["batchsourceMyTable"],
                   "labeloutputs": {
    								"FTP",: {
    										"data":	[
                  "artifact":{                   			{"offset": 1, "namebody":"core-plugins",
       "100,bob"},
                   "version":"1.4.0-SNAPSHOT",                  			{"offset": 2, "scopebody": "SYSTEM"
                   200,rob"},
                   "properties":{                     "referenceName":"myfile",
                      "path":"/tmp/myfile"			{"offset": 3, "body": "300,tom"}
                   }             },             "outputSchema":"{\"fields\":[{\"name\":\"offset\",\"type\":\"long\"},{\"name\":\"body\",\"type\":\"string\"}]}"
             },			],
    										"schema": {
    													"type" : "record",
    													"fields": [
    																{"name":"offset","type":"long"},
    															  	{"name":"body","type":"string"}
    															 ]
    												  }
    								}
    						}	
    			}
     
        }
    }

    a.  Simple pipeline

    Code Block
    Consider simple {pipeline represented by following connections.
    
    (FTP)-------->(CSV Parser)-------->(Table)
     
    CASE 1: To preview the entire "name":"MyCSVParser",
                "plugin":pipeline:
     
    "preview": {
                     "namestartStages": ["CSVParserFTP"],
      			             "type"endStages": ["transformTable"],
                   "labeluseSinks": ["CSVParserTable"],
    			   "numRecords":        10,
    			    "artifactprogramName":{ "SmartWorkflow",  // The program to be           previewed
    			   "nameprogramType": "transform-plugins",
                      "version":"1.4.0-SNAPSHOT",workflow"		// The program type to be previewed	
    			}
     
    CASE 2: To preview section of the pipeline: (CSV Parser)-------->(Table)
     
    "preview": {
                   "startStages": ["CSVParser"],
    			   "scopeendStages": ["SYSTEMTable"],
                   }"useSinks": ["Table"],
                   "propertiesoutputs": {
    								"FTP": 
       {
    										"data":	[
                  "format":"DEFAULT",                   "schema":"{\"type\":\"record\",\"name\":\"etlSchemaBody\",\"fields\":[{\"name\":\"id\",\"type\":\"int\"},{\"name\":\"name\",\"type\":\"string\"}]}"			{"offset": 1, "body": "100,bob"},
                      "field":"body"                }
       			{"offset": 2, "body": "200,rob"},
              },                       "outputSchema":"{\"type\":\"record\",\"name\":\"etlSchemaBody\",\"fields\":[{\"name\":\"id\",\"type\":\"int\"},{\"name\":\"name\",\"type\":\"string\"}]}"			{"offset": 3, "body": "300,tom"}
             },          {           			],
    										"schema":   "name":"MyTable",
                "plugin":{  
                   "name":"Table",
                   "type":"batchsink",
                   "label":"Table",{
    													"type" : "record",
    													"fields": [
    																{"name":"offset","type":"long"},
    															  	{"name":"body","type":"string"}
    															 ]
    												  }
    								}
    						}	
    			}
     
    CASE 3: To preview only single stage (CSV Parser) in the pipeline:
     
    "preview": {
                   "startStages": ["CSV Parser"],
    			   "artifactendStages":{ ["CSV Parser"],
                   "outputs":   "name":"core-plugins",{
    								"FTP": {
    										"data":	[
                      "version":"1.4.0-SNAPSHOT",                   "scope":"SYSTEM"			{"offset": 1, "body": "100,bob"},
                   },                "properties":{  			{"offset": 2, "body": "200,rob"},
                   "schema":"{\"type\":\"record\",\"name\":\"etlSchemaBody\",\"fields\":[{\"name\":\"id\",\"type\":\"int\"},{\"name\":\"name\",\"type\":\"string\"}]}",                   "name":"mytable",
                      "schema.row.field":"id"			{"offset": 3, "body": "300,tom"}
                   }             },             "outputSchema":"{\"type\":\"record\",\"name\":\"etlSchemaBody\",\"fields\":[{\"name\":\"id\",\"type\":\"int\"},{\"name\":\"name\",\"type\":\"string\"}]}",
                "inputSchema":[  
                   			],
    										"schema": {
    													"type" : "record",
    													"fields": [
    																{"name":"offset","type":"long"},
    															  	{"name":"body","type":"string"}
    															 ]
    												  }
    								}
    						}	
    			}
     
    CASE 4: To verify if records are read correctly from FTP:
    "preview": {
                   "startStages": ["FTP"],
    			   "nameendStages": ["idFTP"],
    			   "numOfRecords": 10
    			}
     
    CASE 5: To verify the data is getting written to Table properly:
    "typepreview":"int", {
                     "nullablestartStages":false ["Table"],
    			   "endStages": ["Table"],
              }     "useSinks": ["Table"],
                   "outputs": {
    								"CSV Parser": {
    										"data":	[
                    "name":"name",                 			{"id": 1, "typename": "stringbob"},
                      "nullable":false               			{"id": 2, "name": "rob"},
                ]          }       ], 	   			{"previewid": {
                   "numRecords" 3, "name": "10tom",}
                   "startStage" : "MyCSVParser",             			],
    										"schema": {
    "endStage"   													"type" : "MyTablerecord",
                   "inputData"  : [
                                     {"offset": 1, "body": "100,bob"}, 
                                     {"offset": 2, "body": "200,rob"}, 
                      													"fields": [
    																{"name":"id","type":"long"},
    															  	{"name":"name","type":"string"}
    															 ]
    												  }
    								}
    						}	
    			}

    b. Fork in the pipeline (multiple sinks)

    Code Block
    Consider the following pipeline:
     
    (S3 Source) --------->(Log Parser)--------->(Group By Aggregator)--------->(Python Evaluator)--------->(Aggregated Result)
                |
     {"offset": 3, "body": "300,tom"}
           			|
    			--------->(Javascript Transform)--------->(Raw Data)
     
    CASE 1: To preview entire pipeline
    "preview": {
                   "startStages":       ]["S3 Source"],
    		}	   	}
    }

    The above preview configuration will read the inputData from preview section and write the data to MyTable. If user does not want to write the data to the sink but only want to preview the MyCSVParser stage the preview configurations will look like below:

    Code Block
    "preview": {"endStages": ["Aggregated Result", "Raw Data"],
                   "numRecordsuseSinks" : ["10Aggregated Result", "Raw Data"], // useSinks seem redundant as endStages is there which can control till what "startStage"point : "MyCSVParser",
       the pipeline need to run
    			   "numOfRecords": 10
    			}
     
    CASE 2: To mock the source
    "endStagepreview"   : "MyCSVParser", // In order to execute single stage start stage is same as end stage
                   "inputData"  : [{
                   "startStages": ["Log Parser", "Javascript Transform"],
    			   "endStages": ["Aggregated Result", "Raw Data"],
                   "useSinks": ["Aggregated Result", "Raw Data"],
                    {"offsetoutputs": 1, "body{
    								"S3 Source": "100,bob"}, 
                                     {"offset": 2, "body": "200,rob"},{
    										"data":	[
    												"127.0.0.1 - frank [10/Oct/2000:13:55:36 -0800] GET /apache_pb.gif HTTP/1.0 200 2326",
    												"127.0.0.1 - bob [10/Oct/2000:14:55:36 -0710] GET /apache_pb.gif HTTP/1.0 200 2326",
    												"127.0.0.1 - tom [10/Oct/2000:23:55:36 -0920] GET /apache_pb.gif HTTP/1.0 200 2326"
                                      {"offset			],
    										"schema": 3, "body"{
    													"type" : "300,tom"}
                                  ]
    		}

     

    a.  Connection configuration in the application JSON can be updated to specify the input mock data for the destination stage of the connection.

    Code Block
    Consider following sample DAG.
    
    
    "connections":[  
             {record",
    													"fields": [
    															  	{"name":"log_line","type":"string"}
    															 ]
    												  }
    								}
    						}	
    			}
     
     
    CASE 3: To preview the section of the pipeline (Log Parser)--------->(Group By Aggregator)--------->(Python Evaluator)
    "preview": {
                   "fromstartStages": ["FTPLog Parser"],
             			   "toendStages": ["CSVParserAggregated Result"],
             },      "useSinks": ["Aggregated Result"],
     {               "fromoutputs":"CSVParser", {
    								"S3 Source":          "to":"Table"
             }
          	]
    
    Now if user wants to provide the input data to the CSVParser instead of reading it from FTP source the inputData can be provided as:
    "connections":[{
    										"data":	[
    												"127.0.0.1 - frank [10/Oct/2000:13:55:36 -0800] GET /apache_pb.gif HTTP/1.0 200 2326",
    												"127.0.0.1 - bob [10/Oct/2000:14:55:36 -0710] GET /apache_pb.gif HTTP/1.0 200 2326",
    												"127.0.0.1 - tom [10/Oct/2000:23:55:36 -0920] GET /apache_pb.gif HTTP/1.0 200 2326"
                             {               "from":"FTP",
                "to":"CSVParser",
    			"inputData"  : [
                                     {"offset": 1, "body": "100,bob"}, 
                       			],
    										"schema": {
    													"type" : "record",
    													"fields": [
    															  	{"name":"log_line","type":"string"}
    															 ]
    												  }
    								}
    						}	
    			}
    
    CASE 4: To preview the single stage Python Evaluator
    "preview": {
                 {"offset": 2, "bodystartStages": "200,rob"}, ["Python Evaluator"],
    			   "endStages": ["Python Evaluator"],
                   "outputs": {
    								"Group By  Aggregator": {
    									"data":      {"offset": 3, "body": "300,tom"}
                                  ]
             },
             {  
                "from":"CSVParser",
                "to":"Table"
             }
          	]
     
    The above configuration indicates that connection from FTP to CSVParser need to be updated to read the inputData instead of reading it from FTP Source directly.
    b. In case there are multiple sources in the pipeline or multiple stages reading from the same source and user wants to provide the mock data for all of them, then every connection emitted from the source need to have the inputData associated with it.
    Code Block
    For example consider the following pipeline DAG
     "connections": [
            {
                "from": "S3 Source",[
    												{"ip":"127.0.0.1", "counts":3},
    												{"ip":"127.0.0.2", "counts":4},
    												{"ip":"127.0.0.3", "counts":5},
    												{"ip":"127.0.0.4", "counts":6},
    						 					],
    									"schema": {
    													"type" : "record",
    													"fields": [
    															  	{"name":"ip","type":"string"},
    															  	{"name":"counts","type":"long"}
    															 ]
    									}
    								}
    						}	
    			}
    
    

    c. Join in the pipeline (multiple sources)

    Code Block
    Consider the following pipeline:
    (Database)--------->(Python Evaluator)--------->
               										| 
    												|------------>(Join)-------->(Projection)------->(HBase Sink)
    												|	
               (FTP)--------->(CSV Parser)--------->
     
     
    CASE 1: To preview entire pipeline
     
    "preview": {
                   "tostartStages": ["Log Parser"
    Database", "FTP"],
    			   "endStages":     }["HBase Sink"],
    			   "useSinks": ["HBase Sink"],	
    			  { "numOfRecords": 10
    			}
    
    
    CASE 2: To mock both sources
     "frompreview": {
     "Log Parser",             "tostartStages": ["GroupPython ByEvaluator", Aggregator"CSV Parser"],
    			   "endStages":    }["HBase Sink"],
    			   "useSinks": ["HBase Sink"],	
      {             "fromoutputs": "Group By Aggregator",
                "to": "Aggregated Result"
            },
            {
                "from": "S3 Source",
                "to": "Raw Logs"
            }
        ]
    
    Now if user want to preview the pipeline but do now want to read the data from the S3 Source, connections can be updated with the inputData information as 
     "connections": [
            {
                "from": "S3 Source",
                "to": "Log Parser",
    			"inputData": {
    							"127.0.0.1 - frank [10/Oct/2000:13:55:36 -0800] GET /apache_pb.gif HTTP/1.0 200 2326",
    
    							"127.0.0.1 - bob [10/Oct/2000:14:55:36 -0710] GET /apache_pb.gif HTTP/1.0 200 2326",
    
    							"127.0.0.1 - tom [10/Oct/2000:23:55:36 -0920] GET /apache_pb.gif HTTP/1.0 200 2326"
    						 }	
    
            },
            {
                "from": "Log Parser",
                "to": "Group By Aggregator"
            },
            {
                "from": "Group By Aggregator",
                "to": "Aggregated Result"
            },
            {
                "from": "S3 Source",
                "to": "Raw Logs",
    			"inputData": {
    							"127.0.0.1 - frank [10/Oct/2000:13:55:36 -0800] GET /apache_pb.gif HTTP/1.0 200 2326",
    							"127.0.0.1 - bob [10/Oct/2000:14:55:36 -0710] GET /apache_pb.gif HTTP/1.0 200 2326",
    							"127.0.0.1 - tom [10/Oct/2000:23:55:36 -0920] GET /apache_pb.gif HTTP/1.0 200 2326"	
    
    						 }
            }
        ]
    c. If user wants to use actual source for the preview, the number of records read from the source can be limited by the numOfRecords property. 
    Code Block
     "connections": [
            {
                "from": "S3 Source",{
    								"Database": {
    									"data": [
    												{"name":"tom", "counts":3},
    												{"name":"bob", "counts":4},
    												{"name":"rob", "counts":5},
    												{"name":"milo", "counts":6}
    						 					],
    									"schema": {
    													"type" : "record",
    													"fields": [
    															  	{"name":"name","type":"string"},
    															  	{"name":"counts","type":"long"}
    															 ]
    									}
    								},
     
    								"FTP": {
    									"data": [
    												{"offset":1, "body":"tom,100"},
    												{"offset":2, "body":"bob,200"},
    												{"offset":3, "body":"rob,300"},
    												{"offset":4, "body":"milo,400"}
    						 					],
    									"schema": {
    													"fields": [
    															  	{"name":"name","type":"string"},
    															  	{"name":"offset","type":"long"}
    															 ]
    									}
    								}
    						}	
    			}
     
     
    CASE 3: To preview JOIN transform only
    "preview": {
                   "startStages": ["JOIN"],
    			   "endStages": ["JOIN"],
                   "tooutputs": {
    "Log Parser",
    								"Python Evaluator": {
    									"numOfRecordsdata": 10
            },
            {
                "from": "Log Parser",
                "to": "Group By Aggregator"
            },
            {
                "from": "Group By Aggregator",
                "to": "Aggregated Result"
            },
            {
                "from": "S3 Source",
                "to": "Raw Logs",
                "numOfRecords": 10,
    			"inputData": {
    [
    												{"name":"tom", "counts":3},
    												{"name":"bob", "counts":4},
    												{"name":"rob", "counts":5},
    												{"name":"milo", "counts":6}
    						 					],
    									"schema": {
    													"127.0.0.1 - frank [10/Oct/2000:13:55:36 -0800] GET /apache_pb.gif HTTP/1.0 200 2326",
    "type" : "record",
    											"127.0.0.1 - bob [10/Oct/2000:14:55:36 -0710] GET /apache_pb.gif HTTP/1.0 200 2326",
    		"fields": [
    															"127.0.0.1 - tom [10/Oct/2000:23:55:36 -0920] GET /apache_pb.gif HTTP/1.0 200 2326"	
    
     	{"name":"name","type":"string"},
    										 }
            }
        ]
     
    In the above example configuration, 10 records will be read from the S3 Source in order to pass them to the Log Parser, however since the inputData is specified for the connection S3 Source to Raw Logs, inputData will be passed to Raw Logs without reading from the S3 Source.

    d. If user do not want to write to the sink, following are few configuration options are available:

    Code Block
    Option a) Specify list of sinks to ignore writing to using ignoreSinks property.
    
    "preview" : {
    				"ignoreSinks" : ["Raw Logs", "Aggregated Results"]  
    			}
    
    Option b) For each connection to the Sink we can add the ignoreConnection property and set it to true as
     "connections": [
            {					  	{"name":"counts","type":"long"}
    															 ]
    									}
    								},
    
    								"CSV Parser": {
    									"data": [
    												{"offset":1, "body":"tom,100"},
    												{"offset":2, "body":"bob,200"},
    												{"offset":3, "body":"rob,300"},
    												{"offset":4, "body":"milo,400"}
    						 					],
    									"schema": {
    													"fields": [
    															  	{"name":"name","type":"string"},
    															  	{"name":"offset","type":"long"}
    															 ]
    									}
    								}
    						}	
    			}
    
    

    d. Preview for a single stage (TBD)

    Code Block
    Consider the pipeline containing only one transform which has no connections yet- 
     
    (Javascript Transform)
    
    The preview configurations can be provided as
     
    "preview": {
                   "fromstartStages": ["S3Javascript SourceTransform"],
          			      "toendStages": ["LogJavascript ParserTransform"],
    			"numOfRecords": 10         },     "outputs":   {
    								"MockSource": {
    
    									"data":           "from": "Log Parser",
                "to": "Group By Aggregator"
            },
            {
                "from": "Group By Aggregator",
                "to": "Aggregated Result",
    			"ignoreConnection": "true"
            },
            {
                "from": "S3 Source",
                "to": "Raw Logs",
    			"numOfRecords": 10
    
            }
        ]
    
    In the example configuration above, preview will write to the Raw Logs, however writing to the sink Aggregated Result is ignored.
    e. 
    f.
  2.  

  3. Once the preview is started, the unique preview id will be generated for it. Preview id could be of the form: namespace_id.app_id.preview. The runtime information (<Preview_id, STATUS) for the preview will be generated and will be stored (in-memory or disk). While preview is running if user again sends the preview request for the same application and the STATUS is RUNNING, user will get 403 status code with "Preview already running for application." error message. This enforces only one preview is running at any instance of time for a given application in a given namespace.
    If the startStage in the preview configurations is not SOURCE plugin, then the preview system will generate the MOCK source in the pipeline which will read the JSON records specified in the inputData field and convert them into the StructureRecord.
    Once the preview execution is complete,
    [
    												{"name":"tom", "id":3},
    												{"name":"bob", "id":4},
    												{"name":"rob", "id":5},
    												{"name":"milo", "id":6}
    						 					],
    									"schema": {
    													"type" : "record",
    													"fields": [
    															  	{"name":"name","type":"string"},
    															  	{"name":"id","type":"long"}
    															 ]
    									}
    								}
    						}	
    			}
    
  4. How to specify the input data: User can specify the input data for preview by inserting the data directly in table format in UI or can upload a file containing the records.
    When the data is inserted in Table format, UI will convert the data into appropriate JSON records.
    When user decides to upload a file, he can upload the JSON file conforming to the schema of the next stage. Ideally we should allow uploading the CSV file as well, however how to interpret the data will be plugin dependent. For example consider the list of CSV records. Now for CSVParser plugin, the entire record will be treated as body however for Table sink, we will have to split the record based on comma to create multiple fields as specified by the next stage's input schema.

  5. Once the preview is started, the unique preview-id will be generated for it. The runtime information (<preview-id, STATUS) for the preview will be generated and will be stored (in-memory or disk). 

  6. Once the preview execution is complete, its runtime information will be updated with the status of the preview (COMPLETED or FAILED).

  7. To get the status of the preview
    Request Method and Endpoint

    Code Block
    GET /v3/namespaces/{namespace-id}/apps/{app-id}/preview/status
    where namespace-id is the name of the namespace
          app-id is the name of the application for which preview data is to be requested

    Response body will contain JSON encoded preview status and optional message if the preview failed.

    Code Block
    1. If preview is RUNNING
    {
    	"status": "RUNNING" 
    }
    
    2. If preview is COMPLETED
    {
    	"status": "COMPLETED" 
    }
    
    3. If preview FAILED
    {
    	"status": "FAILED"
        "errorMessage": "Preview failure root cause message." 
    }

    To get the preview data for stage:
    Request Method and Endpoint

    Code Block
    GET /v3/namespaces/{namespace-id}/apps/{app-id}/preview/stages/{stage-name}
    where namespace-id is the name of the namespace
          app-id is the name of the application for which preview data is to be requested
          stage-name is the unique name used to identify the stage

    Response body will contain JSON encoded input data and output data for the stage as well as input and output schema.

    Code Block{ "inputData": [

    Method and Endpoint

    Code Block
    GET /v3/namespaces/{namespace-id}/previews/{preview-id}/status
    where namespace-id is the name of the namespace
    	  preview-id is the id of the preview for which status is to be requested

    Response body will contain JSON encoded preview status and optional message if the preview failed.

    Code Block
    1. If preview is RUNNING
    {
    	"status": "RUNNING" 
    }
    
    2. If preview is COMPLETED
    {
    	"status": "COMPLETED" 
    }
    
    3. If preview application deployment FAILED
    {
    	"status": "DEPLOY_FAILED",
    	"failureMessage": "Exception message explaining the failure"
    }
     
    4. If preview application FAILED during execution of the stages
    {
    	"status": "RUNTIME_FAILED",
        "failureMessage": "Failure message"
        /*
        "stages": {
    		[
    			"stage_1": {
    				"numOfInputRecords": 10,
    				"numOfOutputRecords": 10
    			},
    			"stage_2": {
    				"numOfInputRecords": 10,
    				"numOfOutputRecords": 7
    			},
    			"stage_3": {
    				"numOfInputRecords": 7,
    				"numOfOutputRecords": 4,
    				"errorMessage": "Failure reason for the stage"
    			}
    		]
    	} */
    }
  8. To get the preview data for stage:
    Request Method and Endpoint

    Code Block
    GET /v3/namespaces/{namespace-id}/previews/{preview-id}/stages/{stage-id}
    where namespace-id is the name of the namespace
    	  preview-id is the id of the preview for which data is to be requested
          stage-id is the unique name used to identify the emitter

    Response body will contain JSON encoded input data and output data for the emitter as well as input and output schema.

    Code Block
    {
    	"inputData": [
                     	{"first_name": "rob", "zipcode": 95131},
                        {"first_name": "bob", "zipcode": 95054},
                        {"first_name": "tom", "zipcode": 94306}
                     ],
    	"outputData":[
    					{"name": "rob", "zipcode": 95131, "age": 21},
                        {"name": "bob", "zipcode": 95054, "age": 22},
                        {"name": "tom", "zipcode": 94306, "age": 23}
    				 ],
    	"inputSchema": {
    						"type":"record",
    						"name":"etlSchemaBody",
    						"fields":[
    									{"name":"first_name", "type":"string"},
    									{"name":"zipcode", "type":"int"}
    								 ]
    					},
    	"outputSchema": {
    						"type":"record",
    						"name":"etlSchemaBody",
    						"fields":[
    									{"name":"name", "type":"string"},
    									{"name":"zipcode", "type":"int"},
    									{"name":"age", "type":"int"}
    								 ]
    					},
        "errorRecordSchema": {
    						"type":"record",
    						"name":"schemaBody",
    						"fields":[
    									{"name":"errCode", "type":"int"},
    									{"name":"errMsg", "type":"String"},
    									{"name":"invalidRecord", "type":"String"}
    								 ]
    					},
    	"errorRecords": [  
          {  
             "errCode":12,
             "errMsg":"Invalid record",
             "invalidRecord":"{\"offset\":3,\"body\":\"This error record is not comma separated!\"}"
          }
       ],	
     
        // Logs per stage may not make sense will need to think more about it.
    	"logs" : {
    		"stage level logs"
    	}
    }
  9. To get the logs/metrics for the preview:
    Request Method and Endpoint

    Code Block
    GET /v3/namespaces/{namespace-id}/previews/{preview-id}/logs
    where namespace-id is the name of the namespace
    	  preview-id is the id of the preview for which data is to be requested
    logs end point return the entire preview logs.
    Sample response for the logs endpoint. Note that it is similar to the regular app.
    [  
       {  
          "log":"This is sample log - 0",
          "offset":"0.1466470037680"
       },
       {  
          "log":"This is sample log - 1",
          "offset":"1.1466470037680"
       },
       {  
          "log":"This is sample log - 2",
          "offset":"2.1466470037680"
       },
       {  
          "log":"This is sample log - 3",
          "offset":"3.1466470037680"
       },
       {  
          "log":"This is sample log - 4",
          "offset":"4.1466470037680"
       }
    ]
     
    GET /v3/namespaces/{namespace-id}/previews/{preview-id}/metrics
    where namespace-id is the name of the namespace
    	  preview-id is the id of the preview for which data is to be requested
     
    Sample response for the metrics. Note that it is similar to the regular app.
    {
        "endTime": 1466469538,
        "resolution": "2147483647s",
        "series": [
            {
                "data": [
                    {
                        "time": 0,
                        "value": 4
                    }
                ],
                "grouping": {},
                "metricName": "user.Projection.records.out"
            },
            {
                "data": [
                    {
                        "time": 0,
                        "value": 4
                    }
                ],
                "grouping": {},
                "metricName": "user.Projection.records.in"
            },
            {
                "data": [
                    {
                        "time": 0,
                        "value": 4
                    }
                ],
                	{"first_namegrouping": "rob", "zipcode": 95131},{},
                "metricName": "user.Stream.records.out"
            },
            {
      {"first_name": "bob", "zipcode": 95054},       "data": [
                {"first_name": "tom", "zipcode": 94306} {
                    ], 	"outputData":[ 					{"name": "rob", "zipcodetime": 95131, "age": 21}0,
                        {"namevalue": "bob", "zipcode": 95054, "age": 22},4
                    }
         {"name": "tom", "zipcode": 94306, "age": 23} 				 ],
    	"inputSchema": {
    						"type":"record",
    						"name":"etlSchemaBody",
    						"fields":[
    									{"name":"first_name", "type":"string"},
    									{"name":"zipcode", "type":"int"}
    								 ]
    					},
    	"outputSchema": {
    						"type":"record",
    						"name":"etlSchemaBody",
    						"fields":[
    									{"name":"name", "type":"string"},
    									{"name":"zipcode", "type":"int"},
    									{"name":"age", "type":"int"}
    								 ]
    					}  
    }

    To get the logs/metrics for the preview:
    Request Method and Endpoint

    Code Block
    GET /v3/namespaces/{namespace-id}/apps/{app-id}/preview/logs
    GET /v3/namespaces/{namespace-id}/apps/{app-id}/preview/metric
    where namespace-id is the name of the namespace
          app-id is the name of the application for which preview data is to be requested
                "grouping": {},
                "metricName": "user.JavaScript.records.in"
            },
            {
                "data": [
                    {
                        "time": 0,
                        "value": 4
                    }
                ],
                "grouping": {},
                "metricName": "user.Stream.records.in"
            }
        ],
        "startTime": 0
    }

    Response would be similar to the regular app.


Open Questions:

  1. How to make it easy for the user to upload the CSV file?
  2. Lookup data is the user dataset. Should we allow mocking of the look up dataset as well?