Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: Changed hosts to example.com

...

 DescriptionProperties requiredJSON Example
FileCopy/FileMove
Action
To be able to copy files from FTP/SFTP
to unix machine, from unix machine to
the HDFS cluster, from FTP to HDFS.

To be able to move the files from source location A
(HDFS/Unix machine/SFTP) to destination location B
(HDFS/unix machine/SFTP).  
  • Source A (SSH into, see properties)
  • Source B
    • hostname
    • login credentials
    • /PATH to file
{		
    ... 
    "config":{
        "connections":[
{
"from": "Copy-File,
  "to": *other end*,
},
...
],
        "stages":[
         {
            "name":"SCP-Copy-File",
            "plugin":{
               "name":"File-Manipulation",
               "type":"action",
               "artifact":{
                  "name":"core-action-plugins",
                  "version":"1.4.0-SNAPSHOT",
                  "scope":"SYSTEM"
               },
               "properties":{
	          "source-host": "hostA.com",
"source-login": "username",
"destination-host": "hostB.com",
"destination-login": "username",
"destination-file-path": "/filepath"
               }
            }
         },
         ...
	]	  
    }
}
SSH Script ActionTo be able to execute Perl/Python/R/Shell/Revo R/Hive Query
scripts located on the remote machine
  • hostname
  • login credentials (keyfiles?)
  • /PATH to script
  • arguments for script
{		
    ... 
    "config":{
        "connections":[
{
"from": "Copy-File,
  "to": *other end*,
},
...
],
        "stages":[
         {
            "name":"Run-Remote-Script",
            "plugin":{
               "name":"SSHShell",
               "type":"action",
               "artifact":{
                  "name":"core-action-plugins",
                  "version":"1.4.0-SNAPSHOT",
                  "scope":"SYSTEM"
               },
               "properties":{
	          "host": "scripthostexample.com",
                  "script-path":"/tmp/myscript",
"login": "username",
"arguments": "{\"name"\":\"timeout\",\"value\:\"10\"},{\"name\":\"user\",\"value\":\"some_user\"}"
               }
            }
         },
         ...
	]	  
    }
}
SQL ActionTo be able to run the SQL stored procedures located on
the remote SQL server, Copy Hive data to SQL Server table
  • username
  • database name
  • file to push
similar to SSH properties
Email actionTo be able to send emails
  • recipient/sender email
  • message
  • subject
  • username/password auth (if needed)
  • protocol (smtp, smtps, tls)
  • smtp host
  • smtp port
{		
    ... 
    "config":{
        "connections":[
{
"from": "Copy-File,
  "to": *other end*,
},
...
],
        "stages":[
         {
            "name":"Email-Bob",
            "plugin":{
               "name":"Email",
               "type":"action",
               "artifact":{
                  "name":"core-action-plugins",
                  "version":"1.4.0-SNAPSHOT",
                  "scope":"SYSTEM"
               },
               "properties":{
	          "protocol": "smtp",
"recipient": "bob@caskbob@example.cocom",
"subject": "PR Review",
"username": "username",
"password": "emailpass",
"message": "Hey Bob, could you take a look at this PR?"
               }
            }
         },
         ...
	]	  
    }
} 

...

  1. Consider the sample pipeline containing all the action nodes. 

    Code Block
    languagejava
    (Script)----->(CopyToHDFS)------->(Hive)------->(SQL Exporter)
    									  | 	
    									  | 	
    									   ------>(HDFSArchive) 
     
    Where:
    1. Script action is responsible for executing (Shell, Perl, Python etc) script located on the specified machine. This action prepares the input data coming in multiple format(JSON, binary, CSV etc) into the single format(flatten records) as expected by the next stages.
    2. CopyToHDFS action is responsible for copying the flattened files generated in the previous stage to specified HDFS directory.
    3. Hive action is responsible for executing the hive script which populates the Hive tables based on the business logic contained in the script.
    4. SQL Exporter exports the hive table to the relational database.
    5. In parallel, HDFS files generated during the step 2 are archived by HDFSArchive action.
  2. Possible configurations for the pipeline:

    Code Block
    languagejava
    {		
        "artifact":{
          "name":"cdap-data-pipeline",
          "version":"3.5.0-SNAPSHOT",
          "scope":"SYSTEM"
        },
        "name":"MyActionPipeline",  
        "config":{
            "connections":[
             {
                "from":"Script",
                "to":"CopyToHDFS"
             },
             {
                "from":"CopyToHDFS",
                "to":"Hive"
             },
             {
                "from":"Hive",
                "to":"SQLExporter"
             },
             {
                "from":"Hive",
                "to":"HDFSArchive"
             }
            ],
            "stages":[
             {
                "name":"Script",
                "plugin":{
                   "name":"SSH",
                   "type":"action",
                   "artifact":{
                      "name":"core-action-plugins",
                      "version":"1.4.0-SNAPSHOT",
                      "scope":"SYSTEM"
                   },
                   "properties":{
    				  "host": "scripthost.com",
                      "scriptFileName":"/tmp/myfile",
    				  "command": "/bin/bash",
    				  "arguments": [
    								 { "name": "timeout", "value": 10 },
    								 { "name": "user", "value": "some_user" }
    				  			   ]			
                   }
                }
             },
             {
                "name":"CopyToHDFS",
                "plugin":{
                   "name":"FileCopy",
                   "type":"action",
                   "artifact":{
                      "name":"core-action-plugins",
                      "version":"1.4.0-SNAPSHOT",
                      "scope":"SYSTEM"
                   },
                   "properties":{
    				  "sourceHost": "source.host.com",
                      "sourceLocation":"/tmp/inputDir",
    				  "wildcard": "*.txt",
    				  "targetLocation": "hdfs://hdfs.cluster.example.com/tmp/output"	
                   }
                }
             },
             {
                "name":"Hive",
                "plugin":{
                   "name":"HIVE",
                   "type":"action",
                   "artifact":{
                      "name":"core-action-plugins",
                      "version":"1.4.0-SNAPSHOT",
                      "scope":"SYSTEM"
                   },
                   "properties":{
    				 "hiveScriptURL": "URL of the hive script",
    				 "blocking": "true",
    				 "arguments": [
    								 { "name": "timeout", "value": 10 },
    								 { "name": "user", "value": "some_user" }
    				  			   ]				 
                   }
                }
    		},
    		{
                "name":"SQLExporter",
                "plugin":{
                   "name":"SQL",
                   "type":"action",
                   "artifact":{
                      "name":"core-action-plugins",
                      "version":"1.4.0-SNAPSHOT",
                      "scope":"SYSTEM"
                   },
                   "properties":{
    				 "connection": "Connection configurations",
    				 "blocking": "true",
    				 "sqlFileName": "/home/user/sql_exporter.sql",
    				 "arguments": [
    								 { "name": "timeout", "value": 10 },
    								 { "name": "user", "value": "some_user" }
    				  			   ]				 
                   }
                }
    		},
            {
                "name":"HDFSArchive",
                "plugin":{
                   "name":"FileMove",
                   "type":"action",
                   "artifact":{
                      "name":"core-action-plugins",
                      "version":"1.4.0-SNAPSHOT",
                      "scope":"SYSTEM"
                   },
                   "properties":{
    				  "sourceLocation": "hdfs://hdfs.cluster.example.com/data/output",
    				  "targetLocation": "hdfs://hdfs.cluster.example.com/data/archive/output",
    				  "wildcard": "*.txt"	
                   }
                }
             }
    	   ]	  
        }
    }
    
    

     

  3. Based on the connection information specified in the above application configuration, DataPipelineApp will configure the Workflow, which will have one custom action corresponding to each plugin type action in the above config. 

...