Normalize

 

Introduction


Users sometimes need to transform one source row into multiple target rows. For example, attributes are stored in columns in one table or file and may need to be broken into multiple records - one record per column attribute. In general it allows one to convert columns to rows. 

Use-Case

  • Convert wide rows and reducing data to it canonicalize form.
  • Reduce the need for restructuring the collection, as new types of data are being added.
  • Let’s assume that we are building a customer 360 Master table that aggregates data for a user from multiple sources. Each of the source has it’s own type of data to be added to a customer id. Instead of creating wide columns, normalization allows one to transform into it’s canonical form and update the customer 360 profile simultaneously from multiple sources. 

Use Stories

  • User should be able to specify a record field mapping. The record field mapping, just maps the field in input schema to the field in the output schema. These are the fields that would be available in every record. 
  • User should be able to specify a normalize field mapping. This will describe the input field name and to what output field it should be mapped to and where the value needs to be added. 

Example

Example 1

Customer 360 Use-case

Assume we have a source ‘Customer Profile’ table and ‘Customer Purchase’ table. 

Custom Profile Table

CustomerIdFirst NameLast NameShipping AddressCredit CardBilling AddressLast Update Date
S23424242JoltieRoot32826 Mars Way, Marsville,  MR, 243442334-232132-232332826 Mars Way, Marsville,  MR, 2434405/12/2015
R45764646RootJoltie32423, Your Way, YourVille, YR, 657652343-12312-1231332421, MyVilla Ct, YourVille, YR, 2342304/03/2012

Build a pipeline that ingest the above data into a Customer 360 Master table that has the following schemas and configuration.

Input Schema
  • Customer ID
  • First Name
  • Last Name
  • Shipping Address
  • Credit Card
  • Billing Address
  • Last Update Date

Normalize
  • Record Field Mapping
    • Input Field : Customer ID, map to : ID
    • Input Field : Last Update Date, map to : Date
  • Record Field Normalizing
    • Input Field : First Name, to : Attribute Type
      • Value to : Attribute Value
    • Input Field : Last Name, to : Attribute Type
      • Value to : Attribute Value
    • Input Field : Credit Card, to : Attribute Type
      • Value to : Attribute Value
    • Input Field : Billing Address, to : Attribute Type
      • Value to : Attribute Value

Output Schema
  • ID
  • Attribute Type
  • Attribute Value
  • Date

Following is the output from running the pipeline with the configurations specified above. Let’s call this ‘Master Customer 360 Table'

IDAttribute TypeAttribute ValueDate
S23424242First NameJoltie05/12/2015
S23424242Last NameRoot05/12/2015
S23424242Credit Card2334-232132-232305/12/2015
S23424242Billing Address32826 Mars Way, Marsville,  MR, 2434405/12/2015
R45764646First NameRoot04/03/2012
R45764646Last NameJoltie04/03/2012
R45764646Credit Card2343-12312-1231304/03/2012
R45764646Billing Address32421, MyVilla Ct, YourVille, YR, 2342304/03/2012

Custom Purchase Table

Customer IdItem IDItem CostPurchase Date
S23424242UR-AR-243123-ST245.6708/09/2015
S23424242SKU-23429424294267.9010/12/2015
R45764646SKU-56775754353214.1506/09/2014

Now, let’s assume the user configures the pipeline for the above data to be added to ‘Master Customer 360 Table'

Following are schema and configuration

Input Schema
  • Customer Id
  • Item ID
  • Item Cost
  • Purchase Date
Normalize
  • Record Field Mapping
    • Input Field : Customer ID, to : ID
    • Input Field : Purchase Date, to : Date
  • Normalize Field Mapping
    • Input Field : Item ID, to : Attribute Type
      • Value to : Attribute Value
    • Input Field : Item Cost, to : Attribute Type
      • Value to : Attribute Value
Output Schema
  • ID
  • Attribute Type
  • Attribute Value
  • Date

Master Customer 360 Table would be updated as follows

IDAttribute TypeAttribute ValueDate
S23424242First NameJoltie05/12/2015
S23424242Last NameRoot05/12/2015
S23424242Credit Card2334-232132-232305/12/2015
S23424242Billing Address32826 Mars Way, Marsville,  MR, 2434405/12/2015
R45764646First NameRoot04/03/2012
R45764646Last NameJoltie04/03/2012
R45764646Credit Card2343-12312-1231304/03/2012
R45764646Billing Address32421, MyVilla Ct, YourVille, YR, 2342304/03/2012
S23424242Item IDUR-AR-243123-ST08/09/2015
S23424242Item Cost245.6708/09/2015
S23424242Item IDSKU-23429424294210/12/2015
S23424242Item Cost67.9010/12/2015
R45764646Item IDSKU-567757543532 06/09/2014
R45764646Item Cost14.15 06/09/2014

Example 2
Assume we got following data from any relational store as source
NameYearMonthRentFoodTransportation
Joltie20160260010050
Root201602900200135
Vikram20160255014575
Jon201602750175125
Jockey20151245012545
Configuration is specified as follows

Input Schema
  • Name, String
  • Year, Int
  • Month, Int
  • Rent, Float
  • Food, Float
  • Transportation, Float
Normalize
  • Record Field Mapping
    • Input Field : Name, to : Name
    • Input Field : Year, to : Year
    • Input Field : Month, to : Month
  • Record Field Normalizing
    • Input Field : Rent, to : ExpenditureType
      • Value to : Expenditure
    • Input Field : Food, to : ExpenditureType
      • Value to : Expenditure
    • Input Field : Transportation, to: ExpenditureType
      • Value to : Expenditure
Output Schema
  • Name, String
  • Year, Int
  • Month, Int
  • ExpenditureType, String
  • Expenditure, Float

Assume we got following data from any relational store as source

NameYearMonthExpenditureTypeExpenditure
Joltie201602Rent600
Joltie201602Food100
Joltie201602Transportation50
Root201602Rent900
Root201602Food200
Root201602Transportation135
Vikram201602Rent550
Vikram201602Food145
Vikram201602Transportation75
Jon201602Rent750
Jon201602Food175
Jon201602Transportation125
Jockey201512Rent450
Jockey201512Food125
Jockey201512Transportation45

 

Design

  • Output schema will be created on the basis of inputs to fieldMapping and fieldNormalizing fields, input json will look like as below:
     {
        "name": "Normalize",
        "plugin": {
        "name": "Normalize",
        "type": "transform",
        "label": "Normalize",
        "properties": {
           "fieldMapping": "Customer ID:ID,Purchase Date:Date",
           "fieldNormalizing": "Item ID:Attribute Type:Attribute Value,Item Cost:Attribute Type:Attribute Value"
         }
       }
     }
  • fieldMapping and fieldNormalizing will be mandatory fields
  • Type of all output schema fields will be STRING
  • Type and Value mapping for all normalize fields must be same, otherwise it is invalid.
    Example: Below mapping shows, Item ID and Item Cost mapped to normalize columns Attribute Type and Attribute Value.
    "fieldNormalizing": "Item ID:Attribute Type:Attribute Value,Item Cost:Attribute Type:Attribute Value"

    But below mapping is invalid as Item Cost mapped to different normalize columns.

    "fieldNormalizing": "Item ID:Attribute Type:Attribute Value,Item Cost:Cost Type:Cost Value"

Table of Contents

Checklist

  • User stories documented 
  • User stories reviewed 
  • Design documented 
  • Design reviewed 
  • Feature merged 
  • Examples and guides 
  • Integration tests 
  • Documentation for feature 
  • Short video demonstrating the feature