Goal

The XML Parser uses XPath to extract field from a complex XML Event. This is generally used in conjunction with the XML Source Reader. The XML Source Reader will provide individual events to the XML Parser and the XML Parser is responsible for extracting fields from the event and mapping them to output schema.

A simple example - assume you have a XML Event that looks as follows:

<employee>
  <name>
   <first>Joltie</first>
   <last>Root</last>
  </name>
  <address>
   <street> 180, Mars Ave </street>
   <city> Marsville </city>
   <state> Marscity </state>
   <country> M.A.R.S </country>
   <coordinates>
     <lat>89</lat>
     <long>117</long>
   </coordinate>
  </address>
  <dob>
    <day>1</day>
    <month>Jan</month>
    <year>2177</year>
  </dob>
</employee>

User wants to extract following fields from the XML event.

first
last
lat
long
dob year

User uses the following XPath to extract those fields

/employee/name/first
/employee/name/last
/employee/address/coordinates/lat
/employee/address/coordinates/long
/employee/dob/year

Checklist

User stories documented
User stories reviewed
Design documented
Design reviewed
Feature merged
Examples and guides
Integration tests
Documentation for feature
Short video demonstrating the feature

Use-case

User should be able to specify the input field that should be considered as source of XML event or record.
User is able to specify XML encoding (default is UTF-8)
The Plugin should ignore comments in XML
User is able to specify a collection of XPath to output field name mapping
- User is able to extract values from Attribute (as supported by XPath)
- User is NOT able to XPaths that are arrays. It should be runtime error.
User is able to specify the output field types and the plugin performs appropriate conversions
User is able to specify what should happen when there is error in processing
- User can specify that the error record should be ignored
- User can specify that upon error it should stop processing
- User can specify that all the error records should be written to separate dataset

Design

Approaches: To fetch the output field name, its type and xPath from user.

As an alternative to the output schema in the right panel to fetch the output field name and its types from user, following are the approaches that can be used, in order to read the above information using widgets

First Approach:

Use of 3 field widget - “two text boxes and one dropdown”

Textbox	Textbox	Dropdown
output field name	xPath	Output field type

In this case, dropdown will be populated with supported data types, and user will select one out of that. It will ensure the proper data type is entered.

Note : The GroupbyAggregate plugin has a widget with 2 text-boxes and a drop-down.

1st text box - xpathMappings

drop-down - field type

2nd text box - fieldName

But the keyword "as" present in the widget may be confusing to the user.

Or

Second Approach(Implemented):

Use two widgets as mentioned below:

1. keyvalue(xpathMappings) : to get the output field name and its xPath.

Textbox	Textbox
output field name	xPath

2. schema : to get the output field name and its type.

Note : Both the config properties - xpathMappings and schema will be mandatory)

Or

Third Approach:

Use 3 text box widgets, that we already have.

Textbox	Textbox	Textbox
output field name	xPath	Output field type

In this case, we will need to add validation part during the pipeline, to make sure that user has entered the right data types.

Also, the description will include the supported data types, which will help user to enter the correct or expected types for particular field.

Note : This will require extra validations to check if the correct value for field type is entered.

Assumptions:

For every structured record received to the transform plugin, the output will also be a single structured record.

Example

Questions/Clarifications

Clarifications:

For defining the output field types, field names and xpath value, following approach can be used:
1. Common widget with 2 text boxes and a drop down or
2. key value widget to take the output field name and xpath expression, and a second output schema widget
User is able to specify what should happen when there is error in processing. Errors could be:

IllegalCharacter
Type conversion error
NULL or EMPTY value for non nullable column value

Requirement: User is NOT able to XPaths that are arrays. It should be runtime error.
Understanding: xPath returning multiple nodes with same name. Suppose we have below input:
```
<Cities>
    <City>Paris</City>
    <City>Lyon</City>
    <City>Marseille</City>
 </Cities>
```
And user wants to extract city ['Paris', 'Lyon', 'Marseille'] and provides xPath till /Cities /City. Then as per our use case, we should throw an error.
a. Is this understanding correct or are we missing anything on xPath arrays?
In case, user chooses to write the error records to a separate dataset, then record will be emitted using emitter.emitError() in transform method.
1. Is the understanding correct?

Questions:

If the xpath evaluates to a node with child elements, how should the plugin handle this?
1. Return the text in child node elements as comma separated values
2. Return value as an XML record, similar to the record emitted by XMLReader plugin
3. Throw exception
4. or anything other than this