Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Table of Contents

Checklist

  •  User Stories Documented
  •  User Stories Reviewed
  •  Design Reviewed
  •  APIs reviewed
  •  Release priorities assigned
  •  Test cases reviewed
  •  Blog post

Overview 

******Will be updated with new parse changes************

This addition will allow users to see the history of directives made to a column of data.

Goals

User should be able to see lineage information, ie. directives, for columns

Storing lineage information should have minimal/no impact to the wrangler application

User Stories

  • As a user, I should be able to see the directives applied to a column of data.

  • As a user, I should be able to see the directives applied to a column of data over any period of time.

  • As a user, I should be able to add tags and properties to specific columns of data (stretch)

Design

Save directives for each column in AST format during execution after parsing of wrangler directives along with necessary information (time, dataset/stream name/id).

Use TMS to send information to platform.

Unmarshal and store in HBase.

Access to lineage should only be available through the platform

Questions

  • How to get source and sink datasets?How to handle set columns step
  • Does ParseTree have all necessary information for every directive?

Approach

Approach #1:

Store directives during execution of each step

Advantages:

  • DirectivesLineage class will not need to be modified with new directives
  • Would integrate well with user defined directive and other applications in future
  • Less assumptions

Disadvantages:

  • Add getter to each step class + sometimes (~30%) local variable
  • Slower

Approach #2 (Preferred):

Compute lineage without looking at data by backtracking

Advantages:

  • No instance variables added to step classes
  • Faster

Disadvantages:

  • Breaks with some directives (See * below); requires Requires stricter rule on directives, ie. every rename must give old and new name. See * below for why
  • More assumption based, ie. parse-as-<> assumes that the output fields are from all the input fields

*Backtrack starting with columns A,B,C. Previous directive is "set-columns A B C". The directive before that is "filter-row-if-matched <column> <regex>" where <column> is nameOfOwner. No way of knowing what nameOfOwner refers to without looking at data.

API changes

New Programmatic APIs

DirectivesLineage FieldLevelLineage Java class that contains all necessary information to be sent to CDAP platform

Instance should be initialized passing in list of final columns after wranglewrangling.

store() takes a step at a time or could be trivially changed to take a list of stepsParseTree and stores all the necessary information into lineages.

Stores lineage for each column in lineage instance variable which is a map to ASTs.

Code Block
themeEclipse
languagejava
titleDirectivesLineage
linenumberstrue
collapsetrue
public class DirectivesLineageFieldLevelLineage {
  private class BranchingStepNode {
    boolean continueDown;
    Step directive;
    Map<String, Integer> branches;
    // constructors, toString()
  }
  private String dataSetName; // dataset/stream name/id
  private final long startTime;
  private final String programName;
  private final String[] finalNamesfinalColumns;
  private final Set<String> currentColumns; // not sure if needed
  private final Map<String, List<MetaStep>>List<BranchingStepNode>> lineage; // main storage

  public DirectivesLineageFieldLevelLineage(String dataSetName, String[] columnNames) {...}

  // getters for startTime, programName, dataSetName, finalColumns
  // setter for dataSetName   
 
  private// List<String>Helpers getSwappedCols(List<String> in, List<String> out) {...}
  private List<String> getRenamedCols(List<String> in, List<String> out) {...}
  private List<String> getReadCols(List<String> in, List<String> out) {...}
  private List<String> getAddedCols(List<String> out) {...}
  private List<String> getDroppedCols(List<String> in) {...}
 
  // More helpers for parse

  public void store(Step currStep) {
	/*
     * Get input columns and output columns from currStep function
     * Use get.*() functions to get 5 lists. // given by new way of parsing directives
     * From these 5for store

  public void store(ParseTree tree) {
    List<String> readCols, addCols, modifyCols, dropCols, renameCols, swapCols;
	/**
     * Go through tree one directive at a time
     * For each column associated with the directive put name of column into associated list based on label 
     * From these 6 lists store correctly into lineage
    
* 
	 */
  }

  private class MetaStep {
    Step directive;
    List<String> columnName;
    List<Integer> skipSteps;

    // constructor
 
   
void put(String colName, int skips) {...}
  }
}

Works for every directive except {ChangeColeCaseNames, CleanseColumnNames, Columns, ColumnsReplace, Keep}; basically doesn't work well with directives that don't say which columns are being renamed or dropped. Impossible with Columns, or Set Columns.

*Should consider enforcing rule of directives needing to explicitly say names of columns being dropped or renamed*

Parse Tree: All columns should be labeled one of:  {Read, Drop, Modify, Add, Swap, Rename}. Read: column's name or values are read but not changed. Drop: column is dropped. Modify: column's values altered. Add: column is added. Swap: column's name is swapped with the name of another column. Rename: column's name is changed.

For Read, Drop, Modify, and Add, something like this -> Column: Name, Label: add.

For Swap, Rename, something like this -> Column: body_5 DOB, Label: rename. // Basically some way of having both names. Old/new for rename, A/B for swap.

New REST APIs

PathMethodDescriptionResponse
/v3/namespaces/{namespace-id}/datasets/{dataset-id}/columns/{column-id}/lineage?start=<start-ts>&end=<end-ts>&maxLevels=<max-levels>
GETReturns list of directives applied to the specified column in the specified dataset

200: Successful

Response TBD, but will contain a Tree representation

/v3/namespaces/{namespace-id}/streams/{stream-id}/columns/{column-id}/lineage?start=<start-ts>&end=<end-ts>&maxLevels=<max-levels>
GETReturns list of directives applied to the specified column in the specified stream

200: Successful

Response TBD, but will contain a Tree representation

CLI Impact or Changes

TBD

UI Impact or Changes

  • Add interface to metadata table when viewing dataset to see lineage of columns possibly by clicking on column:
  • When a column is clicked on will look something like:

Security Impact 

Should be none, TBD

Impact on Infrastructure Outages 

Storage in HBase; Impact TBD.

Test Scenarios

Test IDTest DescriptionExpected Results
1Tests all getColumns()directivesAll Step subclasses always properly return should be properly parsed containing all correct columns with correct labels
2Multiple datasets/streams

Lineages are correctly shown between different datasets/streams

3Tests all parsestore()DirectivesLineageFieldLevelLineage.parsestore() always correctly parses stores step

Releases

Release 4.3.0

Release 4.4.0

Related Work

  • Fixing TextDirectives and parsing of directives in general