Introduction

Users sometimes need to transform one source row into multiple target rows. For example, attributes are stored in columns in one table or file and may need to be broken into multiple records - one record per column attribute. In general it allows one to convert columns to rows.

Use-Case

Convert wide rows and reducing data to it canonicalize form.
Reduce the need for restructuring the collection, as new types of data are being added.
Let’s assume that we are building a customer 360 Master table that aggregates data for a user from multiple sources. Each of the source has it’s own type of data to be added to a customer id. Instead of creating wide columns, normalization allows one to transform into it’s canonical form and update the customer 360 profile simultaneously from multiple sources.

Use Stories

User should be able to specify a record field mapping. The record field mapping, just maps the field in input schema to the field in the output schema. These are the fields that would be available in every record.
User should be able to specify a normalize field mapping. This will describe the input field name and to what output field it should be mapped to and where the value needs to be added.

Example

Example 1

Customer 360 Use-case

Assume we have a source ‘Customer Profile’ table and ‘Customer Purchase’ table.

Custom Profile Table

CustomerId	First Name	Last Name	Shipping Address	Credit Card	Billing Address	Last Update Date
S23424242	Joltie	Root	32826 Mars Way, Marsville, MR, 24344	2334-232132-2323	32826 Mars Way, Marsville, MR, 24344	05/12/2015
R45764646	Root	Joltie	32423, Your Way, YourVille, YR, 65765	2343-12312-12313	32421, MyVilla Ct, YourVille, YR, 23423	04/03/2012

Build a pipeline that ingest the above data into a Customer 360 Master table that has the following schemas and configuration.

Input Schema

Customer ID
First Name
Last Name
Shipping Address
Credit Card
Billing Address
Last Update Date

Normalize

Record Field Mapping
- Input Field : Customer ID, map to : ID
- Input Field : Last Update Date, map to : Date
Record Field Normalizing
- Input Field : First Name, to : Attribute Type
  - Value to : Attribute Value
- Input Field : Last Name, to : Attribute Type
  - Value to : Attribute Value
- Input Field : Credit Card, to : Attribute Type
  - Value to : Attribute Value
- Input Field : Billing Address, to : Attribute Type
  - Value to : Attribute Value

Output Schema

ID
Attribute Type
Attribute Value
Date

Following is the output from running the pipeline with the configurations specified above. Let’s call this ‘Master Customer 360 Table'

ID	Attribute Type	Attribute Value	Date
S23424242	First Name	Joltie	05/12/2015
S23424242	Last Name	Root	05/12/2015
S23424242	Credit Card	2334-232132-2323	05/12/2015
S23424242	Billing Address	32826 Mars Way, Marsville, MR, 24344	05/12/2015
R45764646	First Name	Root	04/03/2012
R45764646	Last Name	Joltie	04/03/2012
R45764646	Credit Card	2343-12312-12313	04/03/2012
R45764646	Billing Address	32421, MyVilla Ct, YourVille, YR, 23423	04/03/2012

Custom Purchase Table

Customer Id	Item ID	Item Cost	Purchase Date
S23424242	UR-AR-243123-ST	245.67	08/09/2015
S23424242	SKU-234294242942	67.90	10/12/2015
R45764646	SKU-567757543532	14.15	06/09/2014

Now, let’s assume the user configures the pipeline for the above data to be added to ‘Master Customer 360 Table'

Following are schema and configuration

Input Schema

Customer Id
Item ID
Item Cost
Purchase Date

Normalize

Record Field Mapping
- Input Field : Customer ID, to : ID
- Input Field : Purchase Date, to : Date
Normalize Field Mapping
- Input Field : Item ID, to : Attribute Type
  - Value to : Attribute Value
- Input Field : Item Cost, to : Attribute Type
  - Value to : Attribute Value

Output Schema

ID
Attribute Type
Attribute Value
Date

Master Customer 360 Table would be updated as follows

ID	Attribute Type	Attribute Value	Date
S23424242	First Name	Joltie	05/12/2015
S23424242	Last Name	Root	05/12/2015
S23424242	Credit Card	2334-232132-2323	05/12/2015
S23424242	Billing Address	32826 Mars Way, Marsville, MR, 24344	05/12/2015
R45764646	First Name	Root	04/03/2012
R45764646	Last Name	Joltie	04/03/2012
R45764646	Credit Card	2343-12312-12313	04/03/2012
R45764646	Billing Address	32421, MyVilla Ct, YourVille, YR, 23423	04/03/2012
S23424242	Item ID	UR-AR-243123-ST	08/09/2015
S23424242	Item Cost	245.67	08/09/2015
S23424242	Item ID	SKU-234294242942	10/12/2015
S23424242	Item Cost	67.90	10/12/2015
R45764646	Item ID	SKU-567757543532	06/09/2014
R45764646	Item Cost	14.15	06/09/2014

Example 2

Assume we got following data from any relational store as source

Name	Year	Month	Rent	Food	Transportation
Joltie	2016	02	600	100	50
Root	2016	02	900	200	135
Vikram	2016	02	550	145	75
Jon	2016	02	750	175	125
Jockey	2015	12	450	125	45

Configuration is specified as follows

Input Schema

Name, String
Year, Int
Month, Int
Rent, Float
Food, Float
Transportation, Float

Normalize

Record Field Mapping
- Input Field : Name, to : Name
- Input Field : Year, to : Year
- Input Field : Month, to : Month
Record Field Normalizing
- Input Field : Rent, to : ExpenditureType
  - Value to : Expenditure
- Input Field : Food, to : ExpenditureType
  - Value to : Expenditure
- Input Field : Transportation, to: ExpenditureType
  - Value to : Expenditure

Output Schema

Name, String
Year, Int
Month, Int
ExpenditureType, String
Expenditure, Float

Assume we got following data from any relational store as source

Name	Year	Month	ExpenditureType	Expenditure
Joltie	2016	02	Rent	600
Joltie	2016	02	Food	100
Joltie	2016	02	Transportation	50
Root	2016	02	Rent	900
Root	2016	02	Food	200
Root	2016	02	Transportation	135
Vikram	2016	02	Rent	550
Vikram	2016	02	Food	145
Vikram	2016	02	Transportation	75
Jon	2016	02	Rent	750
Jon	2016	02	Food	175
Jon	2016	02	Transportation	125
Jockey	2015	12	Rent	450
Jockey	2015	12	Food	125
Jockey	2015	12	Transportation	45

Design

Output schema will be created on the basis of inputs to fieldMapping and fieldNormalizing fields, input json will look like as below:

     {
        "name": "Normalize",
        "plugin": {
        "name": "Normalize",
        "type": "transform",
        "label": "Normalize",
        "properties": {
           "fieldMapping": "Customer ID:ID,Purchase Date:Date",
           "fieldNormalizing": "Item ID:Attribute Type:Attribute Value,Item Cost:Attribute Type:Attribute Value"
         }
       }
     }

fieldMapping and fieldNormalizing will be mandatory fields
Type of all output schema fields will be STRING
Type and Value mapping for all normalize fields must be same, otherwise it is invalid.
Example: Below mapping shows, Item ID and Item Cost mapped to normalize columns Attribute Type and Attribute Value.
```
"fieldNormalizing": "Item ID:Attribute Type:Attribute Value,Item Cost:Attribute Type:Attribute Value"
```
But below mapping is invalid as Item Cost mapped to different normalize columns.
```
"fieldNormalizing": "Item ID:Attribute Type:Attribute Value,Item Cost:Cost Type:Cost Value"
```

Table of Contents

Checklist

User stories documented
User stories reviewed
Design documented
Design reviewed
Feature merged
Examples and guides
Integration tests
Documentation for feature
Short video demonstrating the feature