Row Denormalizer

Introduction

The Row Denormalizer plugin allows one to de-normalise data along with ability to convert data types.

Use-case

Imagine that a source database has a table that stores a variable set of custom attributes for an entity like the one defined below. This model allows adding any number of attributes to an entity.

Key Field
Field Name
Field Value

Key Field	Field Name	Field Value
joltie	FIRST_NAME	Nitin
joltie	LAST_NAME	Motgi
joltie	ADDRESS	150 Grant Ave, Suite A
joltie	CITY	Palo Alto

But once the data is being passed through the Row Denormaliser it should map to the following making it easy to query the data.

Key	FIRST_NAME	LAST_NAME	addr	CITY
joltie	Nitin	Motgi	150 Grant Ave, Suite A	Palo Alto

Conditions

In case a field value is not present then it’s considered as NULL. For Example,
- If Key Field in the input record is NULL, then that particular row will not be considered.
- If Field Name and Field Value is not present, then denormalized output will have NULL for that field.
- If user provides output field which is not present in the input record, then it will be considered as NULL.
Input record will always have 3 fields and all these fields will be of type String.

Options

User is able to specify the Key Field based on the Input Schema (has to be field in Input Schema). This the key of the output row. From the above example it’s “Key Field”
User is able to specify the list of fields that should be considered to form a denormalized record. From the above example it should be ‘FIRST_NAME’, ‘LAST_NAME’, ‘ADDRESS’ & ‘CITY’
Users are able to specify the output field name for each through mapping. From the above example ‘ADDRESS’ in input is mapped to ‘addr’ in output schema.
Similarly simple type conversions should be attempted - {int, long, float, double} -> string.
Plugin should write dropped records to error dataset.

Design

Assumptions:

Since input record will be of Schema.Type = String, hence output schema will also be of type String.

Examples

Properties:.

keyField: key on the basis of which input record will be denormalized. This field should be included in input schema.
nameField: Name of the field in input record that contains output fields to be included in denormalized output.
valueField: Name of the field in input record that contains values for output fields to be included in denormalized output.
outputFields: List of the output fields to be included in denormalized output.
fieldAliases: List of the output fields to rename. The key specifies the name of the field to rename, with its corresponding value specifying the new name for that field.
errorDataset: Name of the dataset that collects dropped records, when value for the 'Key Field' is null.

Example:

{
  "name": "RowDenormalizer",
  "type": "batchaggregator",
  "properties": {
     "outputFields": "Firstname,Lastname,Address",
     "fieldAliases": "Address:Office Address",
     "keyField": "id",
     "nameField": "attribute",
     "valueField": "value",
     "errorDataset": "dropped-records-table"
   }
}

The transform takes Database table as input record that has a Key Field, Field Name, Field Value input fields specified by user, denormalizes it on the basis of the key field, and then returns a denormalized table according to the output schema specified by the user.

For example, if it receives as an input record:

Key Field	Field Name	Field Value
joltie	FIRST_NAME	Nitin
joltie	LAST_NAME	Motgi
joltie	ADDRESS	150 Grant Ave, Suite A
joltie	CITY	Palo Alto

it will transform it to this output record on the basis of Key Field value "joltie" :

Key	FIRST_NAME	LAST_NAME	Addr	CITY
joltie	Nitin	Motgi	150 Grant Ave, Suite A	Palo Alto

And all the dropped records (if any) will be collected by dataset 'dropped-records table', given as input by user.

Table of Contents

Checklist

User stories documented
User stories reviewed
Design documented
Design reviewed
Feature merged
~~Examples and guides~~
Integration tests
Documentation for feature
Short video demonstrating the feature