Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

  • User should be able to specify the input field that should be considered as source of XML event or record.
  • User is able to specify XML encoding (default is UTF-8)
  • The Plugin should ignore comments in XML
  • User is able to specify a collection of XPath to output field name mapping
    • User is able to extract values from Attribute (as supported by XPath)
    • User is NOT able to XPaths that are arrays. It should be runtime error. 
  • User is able to specify the output field types and the plugin performs appropriate conversions
  • User is able to specify what should happen when there is error in processing
    • User can specify that the error record should be ignored
    • User can specify that upon error it should stop processing
    • User can specify that all the error records should be written to separate dataset

 

Design

Approaches:

  • To fetch the output field name, its type and xPath from user.

As an alternative to the output schema in the right panel to fetch the output field name and its types from user, following are the approaches that can be used, in order to read the above information using widgets

First Approach:

Use of 3 field widget - “two text boxes and one dropdown” 

Textbox

Textbox

Dropdown

output field name

xPath

Output field type

In this case, dropdown will be populated with supported data types, and user will select one out of that. It will ensure the proper data type is entered.

Note : The GroupbyAggregate plugin has a widget with 2 text-boxes and a drop-down.

1st text box - xpathMappings

drop-down - field type

2nd text box - fieldName

But the keyword "as" present in the widget may be confusing to the user.

Image Modified

Or

Second Approach(Implemented):

Use two widgets as mentioned below:

1.  keyvalue(xpathMappings) : to get the output field name and its xPath.

Textbox

Textbox

output field name

xPath

...

2.  

...

 keyvalue-dropdown(fieldTypeMapping): to get the output field name and its type. The type dropdown will have the following options : boolean, int, long, float, double, bytes, string

(schema widget is not visible in the "configuration-groups" on UI)

Note : Both the config properties - xpathMappings and schema will be mandatory

...

 Or

Third Approach:

Use 3 text box widgets, that we already have.

Textbox

Textbox

Textbox

output field name

xPath

Output field type

In this case, we will need to add validation part during the pipeline, to make sure that user has entered the right data types.

Also, the description will include the supported data types, which will help user to enter the correct or expected types for particular field.

Note : This will require extra validations to check if the correct value for field type is entered.

 

  • If the xpath evaluates to array for a xml record, it will throw runtime exception. Eg: For the below xml record if the user gives the xpath as  /Cities /City it will result in runtime exception, which will be handled in the way specified by the user.

    Code Block
    languagexml
    <Cities>
        <City>Paris</City>
        <City>Lyon</City>
        <City>Marseille</City>
     </Cities>
  • If the xpath evaluates to a node which contains child node, the plugin will return the node with all its children as xml node string. 
  • If user chooses to write error to dataset the error record will be emitted using the emitter.emitError() available in transform().
  • The fieldName specified in "xpathMappings" and "fieldTypeMapping" should exactly match each other.(This will be validated during configurePipeline())
  • The schema for all fields will be of type nullable, since the xpath can evaluate to null if the expression is not satisfied by a node.

Assumptions:

For every structured record received to the transform plugin, the output will also be a single structured record.

Example

Properties:

input                         : Specifies the field in input that should be considered as source of XML event or record.

encoding                 : Specifies XML encoding type(default is UTF-8).

xpathMappings       : Specifies a mapping from XPath of XML record to field name.

fieldTypeMapping   : Specifies the field name as specified in xpathMappings and its corresponding data type. The data type can be of following types- boolean, int, long, float, double, bytes, string 

processOnError       : Specifies what happens in case of error.

                                      1. Ignore the error record
                                      2. Stop processing upon encoutering error
                                      3. Write error record to different dataset

Example:

        This example parses the xml record received in the the "body" field of the structured record, according to the xpathMappings specified, for each field name.

        The type output schema will be created, using the type specified for each field in "fieldTypeMapping" .

{

"name": "XMLParser",
"plugin": {

"name": "XMLParser",
"type": "transform",
"label": "XMLParser",
"properties": {

"encoding": "UTF-8",
"processOnError": "UTF-8",
"xpathMappings": "category://book/@category,title://book/title,year:/bookstore/book[price>35.00]/year,price:/bookstore/book[price>35.00]/price,subcategory://book/subcategory",
"fieldTypeMapping": "category:string,title:string,year:int,price:double,subcategory:string",
"input": "body"

}

}

}

Questions/Clarifications

Clarifications:

...