Table of Contents |
---|
Checklist
- User Stories Documented
- User Stories Reviewed
- Design Reviewed
- APIs reviewed
- Release priorities assigned
- Test cases reviewed
- Blog post
Overview
This addition will allow users to see the history of directives made to a column of data.
Goals
User should be able to see lineage information, ie. directives applied to the column, for columns
Storing lineage information should have minimal/no impact to the wrangler application
User Stories
As a user, I should be able to see the directives applied to a column of data.
As a user, I should be able to see the directives applied to a column of data over any period of time.
As a user, I should be able to add tags and properties to specific columns of data (stretch)
Design
Save directives for each column in AST format during execution of wrangler along with necessary information (time, dataset/stream name/id).
Use TMS to send information to platform.
Unmarshal and store in HBase.
Access to lineage should only be available through the platform
Approach
Approach #1 (Preferred):
Store directives in DirectivesLineage during execution time
Advantages: No code would have to be run twice. Error checking handled
Disadvantages: Changes to all subclasses of AbstractStep. Two switch statements (new one is quite small)
Approach #2:
Store directives in DirectivesLineage during parse time.
Advantages: Code changes restricted to ~2 files. Only one switch statement.
Disadvantages: Error checking is harder. BIG: Execution time code would have to be done twice for certain directives
API changes
New Programmatic APIs
DirectivesLineage Java class that contains all necessary information to be sent to CDAP platform
Instance should be initialized at Wrangler step; should be parsed along with execute function.
Code Block | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|
| ||||||||||
public class DirectivesLineage { private final long startTime; private final String programName; private int numberOfColumns; private String dataSetName; //Dataset or Stream name/id private Map<String, ColumnDirectives> lineage; public DirectivesLineage(String dataSetName, String[] columnNames) { this.startTime = System.currentTimeMillis(); this.programName = "wrangler"; this.dataSetName = dataSetName; this.numberOfColumns = columnNames.length; this.lineage = new HashMap<>(columnNames.length); for (int i = 0; i < columnNames.length; ++i) lineage.put(columnNames[i], new ColumnDirectives(columnNames[i], i)); } // getters for startTime, programName, dataSetName // setter for dataSetName // helper functions for parse public void parse(Step currStep) {...} private class ColumnDirectives { final String originalName; // If null it is a new column, for linking in database int colNum; // negative if not in workspace; int version = NEW_DIR; // for making copies List<MetaStep> steps = new ArrayList<>(); // AST that stores directives for this column // constructors } private class MetaStep { Step directive; List<String> colNames; // name of column List<String> cdNames; // name of column with version // constructors } } |
New REST APIs
Path | Method | Description | Response |
---|---|---|---|
/v3/namespaces/{namespace-id}/datasets/{dataset-id}/columns/{column-id}/lineage?start=<start-ts>&end=<end-ts>&maxLevels=<max-levels> | GET | Returns list of directives applied to the specified column in the specified dataset | 200: Successful Response TBD, but will contain a Tree representation |
/v3/namespaces/{namespace-id}/streams/{stream-id}/columns/{column-id}/lineage?start=<start-ts>&end=<end-ts>&maxLevels=<max-levels> | GET | Returns list of directives applied to the specified column in the specified stream | 200: Successful Response TBD, but will contain a Tree representation |
CLI Impact or Changes
TBD
UI Impact or Changes
- Add interface to metadata table when viewing dataset to see lineage of columns possibly by clicking on column
Security Impact
Should be none, TBD
Impact on Infrastructure Outages
Storage in HBase; Impact TBD.
Test Scenarios
Test ID | Test Description | Expected Results |
---|---|---|
1 | Tests all getColumns() | All Step subclasses always properly return correct columns |
2 | Multiple datasets/streams | Lineages are correctly shown between different datasets/streams |
3 | Tests all parse() | DirectivesLineage.parse() always correctly parses step |
Releases
Release 4.3.0
Release 4.4.0
Related Work
- Fixing Switch statement in TextDirectives