Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

 

Introduction


Run plugin allows user to run any executable binary installed and available on all Hadoop nodes. The user code is capable of processing the input record and return the output record to be further processed downstream in the pipeline. 

Use-Case


Often times, in enterprise there are existing tools or systems that exist and perform complex transformations of data. These tools are time tested and have been running in production for long time. As more and more processing is being moved to Hadoop, users would like to slowly transition to running on Hadoop. In this case, they would like to have the ability to run the tools as in or with minor modifications. They have the tool or binary installed on all Hadoop nodes and they would like the ability to pass the processing record into the tool and retrieve the results back into the pipeline. 

User Stories

  • User should be able to specify the fully full path to binary or just the binary
  • User should be able to specify the arguments for the binary to be executed
  • User should be provided a specification about how the record is passed to binary (need to be designed)
  • If binary executable doesn’t exist or not in path or not executable, user should be notified appropriately during runtime
  • User is able to see the errors in log if the executable writes the errors to STDERR
  • User executable is able to read the record from STDIN
  • User executable is able to write the record to STDOUT
  • User will make sure the binary and it’s dependencies are available on all machines of the cluster and no capability needs to be added to the plugin for marshaling the executable

Conditions

  • Binary and it’s dependencies must be available on all the machines of the cluster, prior to the execution of binary.
  • Arguments should be in the proper sequence and the supported format. Any mismatch in the sequence of the arguments will result into the failure of execution.

Design


Design Approach Assumptions/Considerations:

  1. Types of binary executable that will be supported by plugin are: "bat, jar, sh and exe".
    Binaries .
  2. Path to the executable binary, specified in 'commandToExecute' property, should be an absolute path not the URI path i.e. should not start with hdfs:// or file:///.

  3. Executable binary will always read the input through STDIN and should generate the STDOUT for each input record. Also, errors emitted by the executable through STDERR will be captured in logs.

  4. Executable binary can take 0 to N

    arguments

    inputs. Source for the

    variable arguments

    varying inputs will always be the structured records coming through the Hydrator

    Source stage.(As it is a transform plugin)
    Sequence of arguments is important. Any mismatch will result into the failure of the execution. Also, the fixed arguments

    source stage and will passed to the binary through STDIN. Required fields can be provided using 'fieldsToProcess' property.

  5. Fixed inputs (if any), will always be followed by the

    variable input arguments. This will be the format of the arguments for any binary to be executed. For example, 
    java -jar <Example.jar> <runtime/variable arg1> <runtime/variable arg2>..... <Other fixed/static named & unnamed arguments>
    Errors will be written to the cdap logs and the respective record will be stored in the Error Dataset, if the executable binary writes the errors to STDERR

    varying inputs. All the inputs will be passed as space separated sequence to the executable binary through STDIN. This will be the format for sending the inputs to the executable binary.

  6. The output of the binary execution will be stored into the target/output fields field which will be provided by user. Final output will include the output fields as well as the input fields coming from the previous stage.
  7. In case, if the binaries do not produce any output, then plugin will write empty string to the target/output field.
 
  1. If the binary does not exists, then it will result into the failure during runtime. (after pipeline is published).
  2. Supported schema types for output field are: "boolean, bytes, double, float, int, long and string".
  3. Plugin will read the standard output and error streams, with UTF_8 encoding.

Run Plugin Properties:

  • commandToExecute : Command  Command that will contain the full path to the executable binary present on the local filesystem of the Hadoop nodes as well as how to execute that binary. For example, java -jar /home/user./ExampleRunner.jar, if the binary to be executed is of type jar.
  • variableArguments fieldsToProcess: A  A comma-separated sequence of the fields which will to be used as input source for the runtime/variable argumentsvariable command line arguments for binary to be executed. For example, 'firstname' or 'firstname,lastname' in case of multiple arguments. Please make sure that the sequence of fields/arguments is proper.in the order as expected by binary. (Macro Enabled)
  • fixedArguments :A space-separated sequence of the fixed input command line arguments that will be passed to the executable binary to be executed. Please make sure that the sequence of the arguments is properin the order as expected by binary. All the fixed input command line arguments are will be followed by the runtime/ variable input arguments.outputFields : A comma-separated sequence of field name and its type which will be used to store command line arguments, provided through 'Fields to Process for Variable Arguments'. (Macro enabled)
  • outputField : The field name that holds the output of the executable binary.

  • outputFieldType: Schema type of the 'Output Field'. Supported types are: boolean, bytes, double, float, int, long and string.

Run Input Json Format:

{
"name": "Run",
"type": "transform",
"properties": {
"commandToExecute": "java -jar /opt/cdap/Runner.jar",
"variableArgumentsfieldsToProcess": "Firstname,Lastname",
"fixedArguments": "256 1024 -Dcheckstyle.skip=true",
"outputFieldsoutputField": "FinalOutput:target",
"outputFieldType": "string",
}
}
Note: More details will be added based on the findings.

Implementation Tips


Table of Contents

Table of Contents
stylecircle

Checklist

  •  User stories documented 
  •  User stories reviewed 
  •  Design documented 
  •  Design reviewed 
  •  Feature merged 
  •  Examples and guides 
  •  Integration tests 
  •  Documentation for feature 
  •  Short video demonstrating the feature