Normalize
- Nitin Motgi
- Amey Kulkarni
- Bhushan Kawadkar
Owned by Nitin Motgi
Â
Introduction
Users sometimes need to transform one source row into multiple target rows. For example, attributes are stored in columns in one table or file and may need to be broken into multiple records - one record per column attribute. In general it allows one to convert columns to rows.Â
Use-Case
- Convert wide rows and reducing data to it canonicalize form.
- Reduce the need for restructuring the collection, as new types of data are being added.
- Let’s assume that we are building a customer 360 Master table that aggregates data for a user from multiple sources. Each of the source has it’s own type of data to be added to a customer id. Instead of creating wide columns, normalization allows one to transform into it’s canonical form and update the customer 360 profile simultaneously from multiple sources.Â
Use Stories
- User should be able to specify a record field mapping. The record field mapping, just maps the field in input schema to the field in the output schema. These are the fields that would be available in every record.Â
- User should be able to specify a normalize field mapping. This will describe the input field name and to what output field it should be mapped to and where the value needs to be added.Â
Example
Example 1
Customer 360 Use-case
Assume we have a source ‘Customer Profile’ table and ‘Customer Purchase’ table.Â
Custom Profile Table
CustomerId | First Name | Last Name | Shipping Address | Credit Card | Billing Address | Last Update Date |
S23424242 | Joltie | Root | 32826 Mars Way, Marsville, Â MR, 24344 | 2334-232132-2323 | 32826 Mars Way, Marsville, Â MR, 24344 | 05/12/2015 |
R45764646 | Root | Joltie | 32423, Your Way, YourVille, YR, 65765 | 2343-12312-12313 | 32421, MyVilla Ct, YourVille, YR, 23423 | 04/03/2012 |
Build a pipeline that ingest the above data into a Customer 360 Master table that has the following schemas and configuration.
Input Schema
- Customer ID
- First Name
- Last Name
- Shipping Address
- Credit Card
- Billing Address
- Last Update Date
Normalize
- Record Field Mapping
- Input Field : Customer ID, map to : ID
- Input Field : Last Update Date, map to : Date
- Record Field Normalizing
- Input Field : First Name, to : Attribute Type
- Value to : Attribute Value
- Input Field : Last Name, to : Attribute Type
- Value to : Attribute Value
- Input Field :Â Credit Card, to : Attribute Type
- Value to : Attribute Value
- Input Field : Billing Address, to : Attribute Type
- Value to : Attribute Value
- Input Field : First Name, to : Attribute Type
Output Schema
- ID
- Attribute Type
- Attribute Value
- Date
Following is the output from running the pipeline with the configurations specified above. Let’s call this ‘Master Customer 360 Table'
ID | Attribute Type | Attribute Value | Date |
S23424242 | First Name | Joltie | 05/12/2015 |
S23424242 | Last Name | Root | 05/12/2015 |
S23424242 | Credit Card | 2334-232132-2323 | 05/12/2015 |
S23424242 | Billing Address | 32826 Mars Way, Marsville, Â MR, 24344 | 05/12/2015 |
R45764646 | First Name | Root | 04/03/2012 |
R45764646 | Last Name | Joltie | 04/03/2012 |
R45764646 | Credit Card | 2343-12312-12313 | 04/03/2012 |
R45764646 | Billing Address | 32421, MyVilla Ct, YourVille, YR, 23423 | 04/03/2012 |
Custom Purchase Table
Customer Id | Item ID | Item Cost | Purchase Date |
S23424242 | UR-AR-243123-ST | 245.67 | 08/09/2015 |
S23424242 | SKU-234294242942 | 67.90 | 10/12/2015 |
R45764646 | SKU-567757543532 | 14.15 | 06/09/2014 |
Now, let’s assume the user configures the pipeline for the above data to be added to ‘Master Customer 360 Table'
Following are schema and configuration
Input Schema
- Customer Id
- Item ID
- Item Cost
- Purchase Date
Normalize
- Record Field Mapping
- Input Field : Customer ID, to : ID
- Input Field : Purchase Date, to : Date
- Normalize Field Mapping
- Input Field : Item ID, to : Attribute Type
- Value to : Attribute Value
- Input Field : Item Cost, to : Attribute Type
- Value to : Attribute Value
- Input Field : Item ID, to : Attribute Type
Output Schema
- ID
- Attribute Type
- Attribute Value
- Date
Master Customer 360 Table would be updated as follows
ID | Attribute Type | Attribute Value | Date |
S23424242 | First Name | Joltie | 05/12/2015 |
S23424242 | Last Name | Root | 05/12/2015 |
S23424242 | Credit Card | 2334-232132-2323 | 05/12/2015 |
S23424242 | Billing Address | 32826 Mars Way, Marsville, Â MR, 24344 | 05/12/2015 |
R45764646 | First Name | Root | 04/03/2012 |
R45764646 | Last Name | Joltie | 04/03/2012 |
R45764646 | Credit Card | 2343-12312-12313 | 04/03/2012 |
R45764646 | Billing Address | 32421, MyVilla Ct, YourVille, YR, 23423 | 04/03/2012 |
S23424242 | Item ID | UR-AR-243123-ST | 08/09/2015 |
S23424242 | Item Cost | 245.67 | 08/09/2015 |
S23424242 | Item ID | SKU-234294242942 | 10/12/2015 |
S23424242 | Item Cost | 67.90 | 10/12/2015 |
R45764646 | Item ID | SKU-567757543532 | Â 06/09/2014 |
R45764646 | Item Cost | 14.15 | Â 06/09/2014 |
Example 2
Assume we got following data from any relational store as source
Name | Year | Month | Rent | Food | Transportation |
Joltie | 2016 | 02 | 600 | 100 | 50 |
Root | 2016 | 02 | 900 | 200 | 135 |
Vikram | 2016 | 02 | 550 | 145 | 75 |
Jon | 2016 | 02 | 750 | 175 | 125 |
Jockey | 2015 | 12 | 450 | 125 | 45 |
Configuration is specified as follows
Input Schema
- Name, String
- Year, Int
- Month, Int
- Rent, Float
- Food, Float
- Transportation, Float
Normalize
- Record Field Mapping
- Input Field : Name, to : Name
- Input Field : Year, to : Year
- Input Field : Month, to : Month
- Record Field Normalizing
- Input Field : Rent, to : ExpenditureType
- Value to : Expenditure
- Input Field : Food, to : ExpenditureType
- Value to : Expenditure
- Input Field : Transportation, to: ExpenditureType
- Value to : Expenditure
- Input Field : Rent, to : ExpenditureType
Output Schema
- Name, String
- Year, Int
- Month, Int
- ExpenditureType, String
- Expenditure, Float
Assume we got following data from any relational store as source
Name | Year | Month | ExpenditureType | Expenditure |
Joltie | 2016 | 02 | Rent | 600 |
Joltie | 2016 | 02 | Food | 100 |
Joltie | 2016 | 02 | Transportation | 50 |
Root | 2016 | 02 | Rent | 900 |
Root | 2016 | 02 | Food | 200 |
Root | 2016 | 02 | Transportation | 135 |
Vikram | 2016 | 02 | Rent | 550 |
Vikram | 2016 | 02 | Food | 145 |
Vikram | 2016 | 02 | Transportation | 75 |
Jon | 2016 | 02 | Rent | 750 |
Jon | 2016 | 02 | Food | 175 |
Jon | 2016 | 02 | Transportation | 125 |
Jockey | 2015 | 12 | Rent | 450 |
Jockey | 2015 | 12 | Food | 125 |
Jockey | 2015 | 12 | Transportation | 45 |
Â
Design
- Output schema will be created on the basis of inputs to fieldMapping and fieldNormalizing fields, input json will look like as below:
{ "name": "Normalize", "plugin": { "name": "Normalize", "type": "transform", "label": "Normalize", "properties": { "fieldMapping": "Customer ID:ID,Purchase Date:Date", "fieldNormalizing": "Item ID:Attribute Type:Attribute Value,Item Cost:Attribute Type:Attribute Value" } } }
- fieldMapping and fieldNormalizing will be mandatory fields
- Type of all output schema fields will be STRING
- Type and Value mapping for all normalize fields must be same, otherwise it is invalid.
Example: Below mapping shows, Item ID and Item Cost mapped to normalize columns Attribute Type and Attribute Value."fieldNormalizing": "Item ID:Attribute Type:Attribute Value,Item Cost:Attribute Type:Attribute Value"
But below mapping is invalid as Item Cost mapped to different normalize columns.
"fieldNormalizing": "Item ID:Attribute Type:Attribute Value,Item Cost:Cost Type:Cost Value"
Table of Contents
Checklist
- User stories documentedÂ
- User stories reviewedÂ
- Design documentedÂ
- Design reviewedÂ
- Feature mergedÂ
- Examples and guidesÂ
- Integration testsÂ
- Documentation for featureÂ
- Short video demonstrating the feature