Table of Contents |
---|
Checklist
- User Stories Documented
- User Stories Reviewed
- Design Reviewed
- APIs reviewed
- Release priorities assigned
- Test cases reviewed
- Blog post
Overview
******Will be updated with new parse changes************
This addition will allow users to see the history of directives made to a column of data.
Goals
User should be able to see lineage information, ie. directives, for columns
Storing lineage information should have minimal/no impact to the wrangler application
User Stories
As a user, I should be able to see the directives applied to a column of data.
As a user, I should be able to see the directives applied to a column of data over any period of time.
As a user, I should be able to add tags and properties to specific columns of data (stretch)
Design
Save directives for each column in AST format during execution of wrangler along with necessary information (time, dataset/stream name/id).
Use TMS to send information to platform.
Unmarshal and store in HBase.
Access to lineage should only be available through the platform
Questions
- How to get source and sink datasets?
- How to handle set columns step?
Approach
Approach #1:
Store directives during execution of each step
Advantages:
- No switch statement; DirectivesLineage class will not need to be modified with new directives
- Would integrate well with user defined directive and other applications in future
Disadvantages:
- Add getter to each step class + sometimes (~30%) local variable
- Slower
Approach #2 (Preferred):
Compute lineage without looking at data by backtracking
Advantages:
- Wouldn't need to modify any step functionsNo instance variables added to step classes
- Faster
Disadvantages:
Will require large Switch or If/Else statement- Breaks with some directives (See * below); requires stricter rule on directives, ie. every rename must give old and new name
- More assumption based, ie. parse-as-<> assumes that the output fields are from all the input fields
- Any new directive would have to change DirectivesLineage class
- Wouldn't integrate well with user defined directives
*Backtrack starting with columns A,B,C. Previous directive is "set-columns A B C". The directive before that is "filter-row-if-matched <column> <regex>" where <column> is nameOfOwner. No way of knowing what nameOfOwner refers to without looking at data.
API changes
New Programmatic APIs
DirectivesLineage Java class that contains all necessary information to be sent to CDAP platform
Instance should be initialized at Wrangler step; should be parsed along with execute functionpassing in list of final columns after wrangle.
store() takes a step at a time or could be trivially changed to take a list of steps.
Stores lineage for each column in lineage instance variable which is a map to ASTs.
Code Block | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|
| ||||||||||
public class DirectivesLineage { private String dataSetName; // dataset/stream name/id private final long startTime; private final String programName; private final String[] finalNames; private final Set<String> currentColumns; private final Map<String, List<MetaStep>> lineage; public DirectivesLineage(String dataSetName, String[] columnNames) {...} // getters for startTime, programName, dataSetName // setter for dataSetName private List<String> getSwappedCols(List<String> in, List<String> out) {...} private List<String> getRenamedCols(List<String> in, List<String> out) {...} private List<String> getReadCols(List<String> in, List<String> out) {...} private List<String> getAddedCols(List<String> out) {...} private List<String> getDroppedCols(List<String> in) {...} // More helpers for parse public void store(Step currStep) { /* * Get input columns and output columns from currStep function * Use get.*() functions to get 5 lists. // given by new way of parsing directives * Comprehensive solution no case statement From these 5 lists store correctly into lineage * */ } private class MetaStep { Step directive; List<String> columnName; List<Integer> skipSteps; // constructor void put(String colName, int skips) {...} } } |
Works for every directive except {ChangeColeCaseNames, CleanseColumnNames, Columns, ColumnsReplace, Keep}; basically doesn't work well with directives that don't say which columns are being renamed or dropped. Impossible with Columns, or Set Columns.
*Should consider enforcing rule of directives needing to explicitly say names of columns being dropped or renamed*.
New REST APIs
Path | Method | Description | Response |
---|---|---|---|
/v3/namespaces/{namespace-id}/datasets/{dataset-id}/columns/{column-id}/lineage?start=<start-ts>&end=<end-ts>&maxLevels=<max-levels> | GET | Returns list of directives applied to the specified column in the specified dataset | 200: Successful Response TBD, but will contain a Tree representation |
/v3/namespaces/{namespace-id}/streams/{stream-id}/columns/{column-id}/lineage?start=<start-ts>&end=<end-ts>&maxLevels=<max-levels> | GET | Returns list of directives applied to the specified column in the specified stream | 200: Successful Response TBD, but will contain a Tree representation |
CLI Impact or Changes
TBD
UI Impact or Changes
- Add interface to metadata table when viewing dataset to see lineage of columns possibly by clicking on column:
- When a column is clicked on will look something like:
Security Impact
Should be none, TBD
Impact on Infrastructure Outages
Storage in HBase; Impact TBD.
Test Scenarios
Test ID | Test Description | Expected Results |
---|---|---|
1 | Tests all getColumns() | All Step subclasses always properly return correct columns |
2 | Multiple datasets/streams | Lineages are correctly shown between different datasets/streams |
3 | Tests all parse() | DirectivesLineage.parse() always correctly parses step |
Releases
Release 4.3.0
Release 4.4.0
Related Work
- Fixing TextDirectives and parsing of directives in general