Introduction

CDAP currently captures lineage at the dataset level. With lineage, users can tell the program that read from or wrote to a dataset. It can help users determine which program wrote to/read from a dataset in a given timeframe. They can keep drilling into either the upstream or the downstream direction.

However, as a platform, CDAP understands schemas for most datasets. Schemas contain fields. It would be useful to be able to drill into how a field in a particular dataset was used(CREATE/READ/WRITE/DELETE) in a given time period.

Goals

Provide CDAP platform support (in the form of API and storage) to track field level lineage.
Pipelines can then expose this functionality to the plugins.
Plugins (such as wrangler) will need to be updated to use this feature.

User Stories

Breakdown of User-Stories

User Story #1

User Story #2

User Story #3

Id

User Story

FLL-1

As a data governance reviewer or information architect at a financial institution,

I would like to generate a report of how a PII field UID from the dataset DailyTransactions was consumed in the specified time period

so that

UID is PII data, and I would like to know which processes accessed it in the given time period
If the data is breached/compromised or improperly generated, I can understand its impact, and take appropriate remedial action
I can also understand which fields may have been generated from the UID field in downstream processes, and judge the impact on those fields to take steps towards remediation
I can generate compliance reports to certify that my organization does not breach contracts with third-party data providers. My licenses with third party providers require me to enforce strict retention policies on their data and any generated downstream data. I would like to ensure that those retention policies are adhered to at all times.

FLL-2

As a data scientist at a healthcare organization,

I would like to trace the provenance of the field patient_medical_score in the dataset PatientRecords over the last month

so that

I can understand the true source(s) from which the patient_medical_score was generated
I can understand the operations that were performed on various source fields to generate the patient_medical_score
I can use this information to determine if patient_medical_score is a suitable field that can be relied upon for generating my ML model.

FLL-3

As the developer of a plugin that defines some transformations on the source in a pipeline,

I would like to be able to register that I performed a particular kind of operation (READ/WRITE) on an input field which generated an output field at a given instant

so that

Users can view lineage of the fields in their data in future.

Design

API to build the operations to be tracked at the field level.

Code Block

language	java

/**
 * Following enum is already available in the CDAP which can be made part of the cdap-api. 
 * This type can be used to track the field level lineage as well.
 */
public enum AccessType {
	READ ('r'),
	WRITE ('w'),
	READ_WRITE ('a'),
	UNKNOWN('u');
}
 
/**
 * Following class can be used in the program to provide the field level lineage information.
 */
public class FieldAccess {
	private final Field field; // Represents the field with associated schema
	private final AccessType type; // Represents access to be performed on the field
}

Option #1: Add new interface which will allow programs to record the field level lineage.

Code Block

language	java

/**
 * This interface provides methods that will allow programs to record the lineage at the field level.
 */
public interface LineageRecorder {
    /**
 	 * Record the lineage for the given dataset.
 	 *
 	 * @param datasetName The name of the Dataset
 	 * @param fieldAccesses The set of field level accesses that will be performed on the dataset
 	*/
    void recordLineage(String datasetName, Set<FieldAccess> fieldAccesses);
 
    /**
 	 * Record the lineage for the given dataset in a given namespace.
 	 *
 	 * @param namespace The name of the namespace
 	 * @param datasetName The name of the Dataset
 	 * @param fieldAccesses The set of field level accesses that will be performed on the dataset in the given namespace
 	*/
 	void recordLineage(String namespace, String datasetName, Set<FieldAccess> fieldAccesses);
}

DatasetContext interface can then extend the LineageRecorder interface so that programs can record the field level lineage.

Pros:

Keep the DatasetContext simpler and only adds two additional methods for recording lineage.

Cons:

Additional method calls are required by user for recording the field level lineage.

Option #2: Add new versions of the getDataset methods which will allow users to specify the set of field operations.

Code Block

language	java

/**
 * Get an instance of the specified Dataset.
 *
 * @param name The name of the Dataset
 * @param fieldAccesses The field accesses that are expected to be performed on the dataset
 * @param <T> The type of the Dataset
 * @return An instance of the specified Dataset, never null.
 * @throws DatasetInstantiationException If the Dataset cannot be instantiated: its class
 *         cannot be loaded; the default constructor throws an exception; or the Dataset
 *         cannot be opened (for example, one of the underlying tables in the DataFabric
 *         cannot be accessed).
 */
<T extends Dataset> T getDataset(String name, Set<FieldAccess> fieldAccesses) throws DatasetInstantiationException;

/**
 * Get an instance of the specified Dataset.
 *
 * @param namespace The namespace of the Dataset
 * @param name The name of the Dataset
 * @param fieldAccesses The field accesses that are expected to be performed on the dataset
 * @param <T> The type of the Dataset
 * @return An instance of the specified Dataset, never null.
 * @throws DatasetInstantiationException If the Dataset cannot be instantiated: its class
 *         cannot be loaded; the default constructor throws an exception; or the Dataset
 *         cannot be opened (for example, one of the underlying tables in the DataFabric
 *         cannot be accessed).
 */
<T extends Dataset> T getDataset(String namespace, String name, Set<FieldAccess> fieldAccesses) throws DatasetInstantiationException;

/**
 * Get an instance of the specified Dataset.
 *
 * @param name The name of the Dataset
 * @param arguments the arguments for this dataset instance
 * @param fieldAccesses The field accesses that are expected to be performed on the dataset
 * @param <T> The type of the Dataset
 * @return An instance of the specified Dataset, never null.
 * @throws DatasetInstantiationException If the Dataset cannot be instantiated: its class
 *         cannot be loaded; the default constructor throws an exception; or the Dataset
 *         cannot be opened (for example, one of the underlying tables in the DataFabric
 *         cannot be accessed).
 */
<T extends Dataset> T getDataset(String name, Map<String, String> arguments, Set<FieldAccess> fieldAccesses) throws DatasetInstantiationException;

/**
 * Get an instance of the specified Dataset.
 *
 * @param namespace The namespace of Dataset
 * @param name The name of the Dataset
 * @param arguments the arguments for this dataset instance
 * @param fieldAccesses The field accesses that are expected to be performed on the dataset
 * @param <T> The type of the Dataset
 * @return An instance of the specified Dataset, never null.
 * @throws DatasetInstantiationException If the Dataset cannot be instantiated: its class
 *         cannot be loaded; the default constructor throws an exception; or the Dataset
 *         cannot be opened (for example, one of the underlying tables in the DataFabric
 *         cannot be accessed).
 */
<T extends Dataset> T getDataset(String namespace, String name, Map<String, String> arguments, Set<FieldAccess> fieldAccesses)
  throws DatasetInstantiationException;

Pros:

No need of separate methods to record the lineage. With additional parameter, getDataset method itself can be used to record the lineage information.

Cons:

DatasetContext already has many different versions of the getDataset method which accepts different parameters. Addition of more methods with the new parameters can be confusing to the user.

Approach

Approach #1

Approach #2

API changes

New Programmatic APIs

New Java APIs introduced (both user facing and internal)

Deprecated Programmatic APIs

New REST APIs

Path

Method

Description

Response Code

Response

/v3/apps/<app-id>

GET

Returns the application spec for a given application

200 - On success

404 - When application is not available

500 - Any internal errors

Deprecated REST API

Path	Method	Description
/v3/apps/<app-id>	GET	Returns the application spec for a given application

CLI Impact or Changes

Impact #1
Impact #2
Impact #3

UI Impact or Changes

Impact #1
Impact #2
Impact #3

Security Impact

What's the impact on Authorization and how does the design take care of this aspect

Impact on Infrastructure Outages

System behavior (if applicable - document impact on downstream [ YARN, HBase etc ] component failures) and how does the design take care of these aspect

Test Scenarios

Test ID	Test Description	Expected Results

Versions Compared

Old Version 6

New Version 7

Key

Table of Contents

Introduction

Goals

User Stories

Design

Option #1: Add new interface which will allow programs to record the field level lineage.

Option #2: Add new versions of the getDataset methods which will allow users to specify the set of field operations.

Approach

Approach #1

Approach #2

API changes

New Programmatic APIs

Deprecated Programmatic APIs

New REST APIs

Deprecated REST API

CLI Impact or Changes

UI Impact or Changes

Security Impact

Impact on Infrastructure Outages

Test Scenarios

Releases

Release X.Y.Z

Release X.Y.Z

Related Work

Future work

Page Comparison

Versions Compared

Old Version 6

New Version 7

Key

Table of Contents

Introduction

Goals

User Stories

Design

Option #1: Add new interface which will allow programs to record the field level lineage.

Option #2: Add new versions of the getDataset methods which will allow users to specify the set of field operations.

Approach

Approach #1

Approach #2

API changes

New Programmatic APIs

Deprecated Programmatic APIs

New REST APIs

Deprecated REST API

CLI Impact or Changes

UI Impact or Changes

Security Impact

Impact on Infrastructure Outages

Test Scenarios

Releases

Release X.Y.Z

Release X.Y.Z

Related Work

Future work