Checklist
- User Stories Documented
- User Stories Reviewed
- Design Reviewed
- APIs reviewed
- Release priorities assigned
- Test cases reviewed
- Blog post
Overview
This addition will allow users to see the history of directives made to a column of data.
Goals
User should be able to see lineage information, ie. directives, for columns
Storing lineage information should have minimal/no impact to the wrangler application
User Stories
As a user, I should be able to see the directives applied to a column of data.
As a user, I should be able to see the directives applied to a column of data over any period of time.
As a user, I should be able to add tags and properties to specific columns of data (stretch)
Design
Save directives for each column in AST format after parsing of directives along with necessary information (time, dataset/stream name/id, etc.).
Use TMS to send information to platform.
Unmarshal and store in HBase.
Access to lineage should only be available through the platform
Questions
- How to get source and sink datasets?
- How to ensure this works with multiple transform nodes, even just wrangler nodes?
- Does ParseTree have all necessary information for every directive?
Approach
Approach #1:
Store directives during execution of each step
Advantages:
- Less assumptions
Disadvantages:
- Add getter to each step class + sometimes (~30%) local variable
- Slower
Approach #2 (Preferred):
Compute lineage without looking at data by backtracking
Advantages:
- No instance variables added to step classes
- Faster
Disadvantages:
- Requires stricter rule on directives, ie. every rename must give old and new name. See * below for why
- More assumption based, ie. parse-as-<> assumes that the output fields are from all the input fields
*Backtrack starting with columns A,B,C. Previous directive is "set-columns A B C". The directive before that is "lowercase <column>" where <column> is nameOfOwner. No way of knowing what nameOfOwner refers to without looking at data.
API changes
New Programmatic APIs
FieldLevelLineage is a Java class that contains all the necessary information to be sent to the CDAP platform
Instance should be initialized per wrangler node passing in a list of final columns after executing the directives.
store() takes a ParseTree and stores all the necessary information into lineages.
Stores lineage for each column in lineage instance variable which is a map to ASTs.
Parse Tree should contain all columns affected per directive.
Labels:
- All columns should be labeled one of: {Read, Drop, Modify, Add, Swap, Rename}
- Read: column's name or values are read but not changed
- Drop: column is dropped
- Add: column is added
- Swap: column's name is swapped with the name of another column
- Rename: column's name is replaced with another name
- Modify: column's values altered and doesn't fit in any of the other categories, ie. "lowercase"
For Read, Drop, Modify, and Add the column and associated label should be something like -> Column: Name, Label: add.
For Swap and Rename the column and associated label should be something like -> Column: body_5 DOB, Label: rename. // Basically some way of having both names, currently using a space. Old/new for rename, A/B for swap.
For Read, Modify, and Add there is another option; instead of column name can return {"all columns", "all columns minus _ _ _ _ ", "all columns formatted %s_%d"}, along with label. ie. Column: "all columns minus body", Label: add. "all columns" refers to all columns present in dataset after execution of this step.
For Swap, Rename, and Drop this option is not available; must explicitly return name of all columns involved.
**Assumption (can be changed): All Read columns impact all other columns until another column is being read. ie. A: read, B: read, C: add, D: read, E: add. In this case A and B were read to create C and D was read to create E. Something like merge A and B and copy D.
Algorithm visual: Example wrangler application --> lineage --> List for name column
New REST APIs
Path | Method | Description | Response |
---|---|---|---|
/v3/namespaces/{namespace-id}/datasets/{dataset-id}/columns/{column-id}/lineage?start=<start-ts>&end=<end-ts>&maxLevels=<max-levels> | GET | Returns list of directives applied to the specified column in the specified dataset | 200: Successful Response TBD, but will contain a Tree representation |
/v3/namespaces/{namespace-id}/streams/{stream-id}/columns/{column-id}/lineage?start=<start-ts>&end=<end-ts>&maxLevels=<max-levels> | GET | Returns list of directives applied to the specified column in the specified stream | 200: Successful Response TBD, but will contain a Tree representation |
CLI Impact or Changes
TBD
UI Impact or Changes
- Option 2: Add interface to metadata table when viewing dataset to see lineage of columns possibly by clicking on column: -> When a column is clicked on will look something like:
-> - Option 2: Show all columns at once directly on lineage tab from clicking on dataset, tab between field level and dataset level:
Security Impact
Should be none, TBD
Impact on Infrastructure Outages
Storage in HBase; Impact TBD.
Test Scenarios
Test ID | Test Description | Expected Results |
---|---|---|
1 | Tests all directives | All Step subclasses should be properly parsed containing all correct columns with correct labels |
2 | Multiple datasets/streams | Lineages are correctly shown between different datasets/streams |
3 | Tests all store() | FieldLevelLineage.store() always correctly stores step |
Releases
Release 4.3.0
Release 4.4.0
Related Work
- Fixing TextDirectives and parsing of directives in general