Skip to end of metadata
Go to start of metadata

You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 8 Next »

Checklist

  • User Stories Documented
  • User Stories Reviewed
  • Design Reviewed
  • APIs reviewed
  • Release priorities assigned
  • Test cases reviewed
  • Blog post

Introduction 

CDAP currently captures lineage at the dataset level. With lineage, users can tell the program that read from or wrote to a dataset. It can help users determine which program wrote to/read from a dataset in a given timeframe. They can keep drilling into either the upstream or the downstream direction.

However, as a platform, CDAP understands schemas for most datasets. Schemas contain fields. It would be useful to be able to drill into how a field in a particular dataset was used(CREATE/READ/WRITE/DELETE) in a given time period.

Goals

  • Provide CDAP platform support (in the form of API and storage) to track field level lineage.
  • Pipelines can then expose this functionality to the plugins.
  • Plugins (such as wrangler) will need to be updated to use this feature.

User Stories 

Id
User Story
FLL-1

As a data governance reviewer or information architect at a financial institution,

I would like to generate a report of how a PII field UID from the dataset DailyTransactions was consumed in the specified time period

so that

  1. UID is PII data, and I would like to know which processes accessed it in the given time period
  2. If the data is breached/compromised or improperly generated, I can understand its impact, and take appropriate remedial action
  3. I can also understand which fields may have been generated from the UID field in downstream processes, and judge the impact on those fields to take steps towards remediation
  4. I can generate compliance reports to certify that my organization does not breach contracts with third-party data providers. My licenses with third party providers require me to enforce strict retention policies on their data and any generated downstream data. I would like to ensure that those retention policies are adhered to at all times.
FLL-2

As a data scientist at a healthcare organization,

I would like to trace the provenance of the field patient_medical_score in the dataset PatientRecords over the last month

so that

  1. I can understand the true source(s) from which the patient_medical_score was generated
  2. I can understand the operations that were performed on various source fields to generate the patient_medical_score
  3. I can use this information to determine if patient_medical_score is a suitable field that can be relied upon for generating my ML model.
FLL-3

As the developer of a plugin that defines some transformations on the source in a pipeline,

I would like to be able to register that I performed a particular kind of operation (READ/WRITE) on an input field which generated an output field at a given instant

so that

  1. Users can view lineage of the fields in their data in future.

Design

API to build the operations to be tracked at the field level.

/**
 * Following enum is already available in the CDAP which can be made part of the cdap-api. 
 * This type can be used to track the field level lineage as well.
 */
public enum AccessType {
	READ ('r'),
	WRITE ('w'),
	READ_WRITE ('a'),
	UNKNOWN('u');
}
 
/**
 * Following class can be used in the program to provide the field level business metadata which includes tags, properties, and lineage information.
 */
public class FieldMetadata {
	private final Field field; // Represents the field with associated schema
	private final AccessType type; // Represents access to be performed on the field which will be used for tracking the lineage
    private final Set<String> tags; // Tags associated with the field
    private final Map<String, String> properties; // Additional metadata properties associated with the field
}

Option #1: Add new interface which will allow programs to record the field level lineage.

 

/**
 * This interface provides methods that will allow programs to record the metadata at the field level.
 */
public interface FieldMetadataRecorder {
    /**
 	 * Record the field level metadata for the given dataset.
 	 *
 	 * @param datasetName The name of the Dataset
 	 * @param fieldMetadata The set of field level metadata information
 	*/
    void record(String datasetName, Set<FieldMetadata> fieldMetadata);
 
    /**
 	 * Record the field level metadata for the given dataset in a given namespace.
 	 *
 	 * @param namespace The name of the namespace
 	 * @param datasetName The name of the Dataset
 	 * @param fieldMetadata The set of field level metadata information
 	*/
 	void record(String namespace, String datasetName, Set<FieldMetadata> fieldMetadata);
}

 

DatasetContext interface can then extend the FieldMetadataRecorder interface so that programs can record the field level metadata information.

Pros:

  1. Keep the DatasetContext simpler and only adds two additional methods for recording metadata.

Cons:

  1. Additional method calls are required by user for recording the field level metadata information.

Option #2: Add new versions of the getDataset methods which will allow users to specify the set of field metadata information.

 

/**
 * Get an instance of the specified Dataset.
 *
 * @param name The name of the Dataset
 * @param fieldMetadata The field level metadata operations that are expected to be performed on the dataset
 * @param <T> The type of the Dataset
 * @return An instance of the specified Dataset, never null.
 * @throws DatasetInstantiationException If the Dataset cannot be instantiated: its class
 *         cannot be loaded; the default constructor throws an exception; or the Dataset
 *         cannot be opened (for example, one of the underlying tables in the DataFabric
 *         cannot be accessed).
 */
<T extends Dataset> T getDataset(String name, Set<FieldMetadata> fieldMetadata) throws DatasetInstantiationException;

/**
 * Get an instance of the specified Dataset.
 *
 * @param namespace The namespace of the Dataset
 * @param name The name of the Dataset
 * @param fieldMetadata The field level metadata operations that are expected to be performed on the dataset
 * @param <T> The type of the Dataset
 * @return An instance of the specified Dataset, never null.
 * @throws DatasetInstantiationException If the Dataset cannot be instantiated: its class
 *         cannot be loaded; the default constructor throws an exception; or the Dataset
 *         cannot be opened (for example, one of the underlying tables in the DataFabric
 *         cannot be accessed).
 */
<T extends Dataset> T getDataset(String namespace, String name, Set<FieldMetadata> fieldMetadata) throws DatasetInstantiationException;

/**
 * Get an instance of the specified Dataset.
 *
 * @param name The name of the Dataset
 * @param arguments the arguments for this dataset instance
 * @param fieldMetadata The field level metadata operations that are expected to be performed on the dataset
 * @param <T> The type of the Dataset
 * @return An instance of the specified Dataset, never null.
 * @throws DatasetInstantiationException If the Dataset cannot be instantiated: its class
 *         cannot be loaded; the default constructor throws an exception; or the Dataset
 *         cannot be opened (for example, one of the underlying tables in the DataFabric
 *         cannot be accessed).
 */
<T extends Dataset> T getDataset(String name, Map<String, String> arguments, Set<FieldMetadata> fieldMetadata) throws DatasetInstantiationException;

/**
 * Get an instance of the specified Dataset.
 *
 * @param namespace The namespace of Dataset
 * @param name The name of the Dataset
 * @param arguments the arguments for this dataset instance
 * @param fieldMetadata The field level metadata operations that are expected to be performed on the dataset
 * @param <T> The type of the Dataset
 * @return An instance of the specified Dataset, never null.
 * @throws DatasetInstantiationException If the Dataset cannot be instantiated: its class
 *         cannot be loaded; the default constructor throws an exception; or the Dataset
 *         cannot be opened (for example, one of the underlying tables in the DataFabric
 *         cannot be accessed).
 */
<T extends Dataset> T getDataset(String namespace, String name, Map<String, String> arguments, Set<FieldMetadata> fieldMetadata)
  throws DatasetInstantiationException;


Pros:

  1. No need of separate methods to record the lineage. With additional parameter, getDataset method itself can be used to record the lineage information.

Cons:

  1. DatasetContext already has many different versions of the getDataset method which accepts different parameters. Addition of more methods with the new parameters can be confusing to the user.

Approach

Approach #1

Approach #2

API changes

New Programmatic APIs

New Java APIs introduced (both user facing and internal)

Deprecated Programmatic APIs

New REST APIs

PathMethodDescriptionResponse CodeResponse
/v3/apps/<app-id>GETReturns the application spec for a given application

200 - On success

404 - When application is not available

500 - Any internal errors

 

     

Deprecated REST API

PathMethodDescription
/v3/apps/<app-id>GETReturns the application spec for a given application

CLI Impact or Changes

  • Impact #1
  • Impact #2
  • Impact #3

UI Impact or Changes

  • Impact #1
  • Impact #2
  • Impact #3

Security Impact 

What's the impact on Authorization and how does the design take care of this aspect

Impact on Infrastructure Outages 

System behavior (if applicable - document impact on downstream [ YARN, HBase etc ] component failures) and how does the design take care of these aspect

Test Scenarios

Test IDTest DescriptionExpected Results
   
   
   
   

Releases

Release X.Y.Z

Release X.Y.Z

Related Work

  • Work #1
  • Work #2
  • Work #3

 

Future work

  • No labels