Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: Task marked incomplete

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

What is Cask Tracker ?

 

Cask Tracker is (formerly Cask Finder) is a CDAP Extension that provides the ability to track data ingested either through Cask Hydrator or a Custom CDAP Application and provide input to data governance process processes on a cluster. It  

It includes following this data about "the data":

  • Metadata
    • Includes Tags, Properties, Schema for CDAP Datasets and Programs
    • System and User 
  • Data Quality &
    • Metadata that include Feed-level and Field-level quality metrics of datasets
  • Data Usage Statistics 
    • Usage statistics of dataset and programs. 

Image Added

Checklist

  •  User stories documented (Nitin)
  •  User stories reviewed (Nitin)
  •  Design documented (Shankar)
  •  

...

  • Design reviewed (Terence/Andreas/Albert)
  •  Feature merged (Shankar)
  •  UI updated (Ajai/Edwin)
  •  Documentation for feature (Shankar)

Use-cases

Following are the use-cases for the Cask Tracker.

Case #1

  • Rishab is a data scientist/engineer at a company that implements a Data Lake. He is analyzing the effectiveness of the recommendation engine on the company's e-commerce site. For this investigation, he wants to analyze a dataset that includes click log for the last year. He is looking for clean click log data that is up-to-date. He wants to use part of the data to build model and rest to score the model and validate the predictions.  
  • Before he can conduct an analysis, Rishab needs to confirm the dataset is available in the Data Lake.
  • To do so, he wishes to find all entities that include “click log”.
  • He arrives at the Finder home screen (from nav, search results, other entry points?).
    • Enters “click log” in the Search Box and clicks Search.
    • He arrives at the Results Page. 
      • Results returned
      • By default, they are sorted by creation time
      • Each Result includes:
        • Snippet of the metadata that matches his query in context.
          • Important to help him evaluate the relevance of the results.
        • Date Created
          • To know how recent/new it is.
  • For this analysis, Rishab is most concerned with the recency, the accuracy, and the integrity of the data.
  • He clicks the result and arrives at the Entity Detail Page where he can view all of the metadata associated with an entity. 
  • Rishab wished to verify the validity of the sources of this dataset. To do so, he clicks the Lineage Tab to trace the creation of this dataset to its source.
  • Finder displays the lineage for this dataset as a diagram. The selected dataset displays in the center; to the left is the entity that precedes it and to the right is the one it precedes.
  • Rishab discovers that it has been created from two separate sources.
  • He then clicks one of the sources which takes him to the Entity Page of that dataset.
  • He clicks on a program to see what has been done to the dataset.
  • Rishab clicks the Audit Logs Tab to see how active this dataset has been - when was it last updated, who is using it, writing to it, reading from it.
  • Rishab clicks the appropriate action to make this dataset a new source for his existing Click Log processing pipeline.
  • This takes him to the Hydrator Studio where he can edit the Master Click Log pipeline.

Case #2 (Incomplete)

  • D-Rock, an IT genius, has been tasked with finding out all the datasets on the cluster that have the field SSN.
  • He needs to find every dataset with an SSN field and mask the field by generating new feed containing a masked SSN. 
  • In order to achieve this task, he uses Cask Finder to achieve this task
    • Enters “SSN” in the Search Box and clicks Search.
    • He then filters by Schema – meaning he would only like to see results which includes datasets where the SSN field is matched in Schema.
  • He then selects the dataset that has this field and and ...

Design

Release 1.0 Deliverables

Tracker API Meeting Notes: 2016-03-01

Initial design for audit log API

Endpoint:

  • GET /auditlog/<dataset>?starttime, endtime

MUST support:

  • relative dates (1d, 1h, 1m, alltime)
  • pagination

MUST return:

  • total results
  • timestamp
  • user
  • action type
  • kind of entity
  • name of entity
  • entity metadata

Example:

MapReduce job "X" accesses dataset "A" for reading:

timestamp: t0
user: default
action type: READ
kind of entity: MAPREDUCE
name of entity: X
entity meta: { runid: ... }