Skip to end of metadata
Go to start of metadata

You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 12 Next »

Checklist

  • User Stories Documented
  • User Stories Reviewed
  • Design Reviewed
  • APIs reviewed
  • Release priorities assigned
  • Test cases reviewed
  • Blog post

Overview 

This addition will allow users to see the history of directives made to a column of data.

Goals

User should be able to see lineage information, ie. directives applied to the column, for columns

Storing lineage information should have minimal/no impact to the wrangler application

User Stories

  • As a user, I should be able to see the directives applied to a column of data.

  • As a user, I should be able to see the directives applied to a column of data over any period of time.

  • As a user, I should be able to add tags and properties to specific columns of data (stretch)

Design

Save directives for each column in AST format during execution of wrangler along with necessary information (time, dataset/stream name/id).

Use TMS to send information to platform.

Unmarshal and store in HBase.

Access to lineage should only be available through the platform

Approach

Approach #1 (Preferred):

Store directives in DirectivesLineage by cases

Advantages: Less overhead. Less information transferred.

Disadvantages: Switch statement. Any new directive added will require a change in DirectivesLineage file

Approach #2:

Store directives in DirectivesLineage by each step sending (dropped, added, modified, read, renamed)

Advantages: No switch statement. Any new directive would not have to change DirectivesLineage class.

Disadvantages: Runtime might increase for certain steps. More overhead in each step class

API changes

New Programmatic APIs

DirectivesLineage Java class that contains all necessary information to be sent to CDAP platform

Instance should be initialized at Wrangler step; should be parsed along with execute function.

DirectivesLineage
 public class DirectivesLineage {
  private final long startTime;
  private final String programName;
  private int numberOfColumns;
  private String dataSetName; //Dataset or Stream name/id
  private Map<String, ColumnDirectives> lineage;

  public DirectivesLineage(String dataSetName, String[] columnNames) {
    this.startTime = System.currentTimeMillis();
    this.programName = "wrangler";
    this.dataSetName = dataSetName;
    this.numberOfColumns = columnNames.length;
    this.lineage = new HashMap<>(columnNames.length);
    for (int i = 0; i < columnNames.length; ++i)
      lineage.put(columnNames[i], new ColumnDirectives(columnNames[i], i));
  }

  // getters for startTime, programName, dataSetName
  // setter for dataSetName

  // helper functions for parse

  public void parse(Step currStep) {...}

  private class ColumnDirectives {
    final String originalName; // If null it is a new column, for linking in database
    int colNum; // negative if not in workspace;
    int version = NEW_DIR; // for making copies
    List<MetaStep> steps = new ArrayList<>();  // AST that stores directives for this column

    // constructors
  }

  private class MetaStep {
    Step directive;
    List<String> colNames; // name of column
    List<String> cdNames; // name of column with version

    // constructors
  }
}

 

New REST APIs

PathMethodDescriptionResponse
/v3/namespaces/{namespace-id}/datasets/{dataset-id}/columns/{column-id}/lineage?start=<start-ts>&end=<end-ts>&maxLevels=<max-levels>
GETReturns list of directives applied to the specified column in the specified dataset

200: Successful

Response TBD, but will contain a Tree representation

/v3/namespaces/{namespace-id}/streams/{stream-id}/columns/{column-id}/lineage?start=<start-ts>&end=<end-ts>&maxLevels=<max-levels>
GETReturns list of directives applied to the specified column in the specified stream

200: Successful

Response TBD, but will contain a Tree representation

CLI Impact or Changes

TBD

UI Impact or Changes

  • Add interface to metadata table when viewing dataset to see lineage of columns possibly by clicking on column

Security Impact 

Should be none, TBD

Impact on Infrastructure Outages 

Storage in HBase; Impact TBD.

Test Scenarios

Test IDTest DescriptionExpected Results
1Tests all getColumns()All Step subclasses always properly return correct columns
2Multiple datasets/streams

Lineages are correctly shown between different datasets/streams

3Tests all parse()DirectivesLineage.parse() always correctly parses step

Releases

Release 4.3.0

Release 4.4.0

Related Work

  • Fixing Switch statement in TextDirectives
  • No labels