Checklist
- User Stories Documented
- User Stories Reviewed
- Design Reviewed
- APIs reviewed
- Release priorities assigned
- Test cases reviewed
- Blog post
Introduction
CDAP currently captures lineage at the dataset level. With lineage, users can tell the program that read from or wrote to a dataset. It can help users determine which program wrote to/read from a dataset in a given timeframe.
However, as a platform, CDAP understands schemas for most datasets. Schemas contain fields. It would be useful to be able to drill into how a field in a particular dataset was used (CREATE/READ/WRITE/DELETE) in a given time period.
Goals
- Provide CDAP platform support (in the form of API and storage) to track field level lineage.
- Pipelines can then expose this functionality to the plugins.
- Plugins (such as wrangler) will need to be updated to use this feature.
Use Cases
Id | Use Case |
---|---|
FLL-1 | As a data governance reviewer or information architect at a financial institution, I would like to generate a report of how a PII field UID from the dataset DailyTransactions was consumed in the specified time period so that
|
FLL-2 | As a data scientist at a healthcare organization, I would like to trace the provenance of the field patient_medical_score in the dataset PatientRecords over the last month so that
|
User Stories
- Spark program can perform various transformations on the input fields of the dataset to generate new fields. For example concatenate the first_name and last_name fields of the input dataset, so that resultant dataset only has Name as field. As a developer of CDAP program(for example CDAP Spark program), I should be able to provide these transformations so that I will know later about how the field Name was generated.
- Similar transformations on the fields can be done in the CDAP plugins as well. Plugin developer should be able to provide such transformations.
- Few plugins such as Javascript transform, Python transform etc execute the custom code provided by the pipeline developer. Pipeline developer in this case should be able to provide the field transformations through the plugin config UI.
Design
API to build the operations to be tracked at the field level.
/** * Following enum is already available in the CDAP which can be made part of the cdap-api. * This type can be used to track the field level lineage as well. */ public enum AccessType { READ ('r'), WRITE ('w'), READ_WRITE ('a'), UNKNOWN('u'); } /** * Following class can be used in the program to provide the field level business metadata which includes tags, properties, and lineage information. */ public class FieldMetadata { private final Field field; // Represents the field with associated schema private final AccessType type; // Represents access to be performed on the field which will be used for tracking the lineage private final Set<String> tags; // Tags associated with the field private final Map<String, String> properties; // Additional metadata properties associated with the field }
Option #1: Add new interface which will allow programs to record the field level lineage.
/** * This interface provides methods that will allow programs to record the metadata at the field level. */ public interface FieldMetadataRecorder { /** * Record the field level metadata for the given dataset. * * @param datasetName The name of the Dataset * @param fieldMetadata The set of field level metadata information */ void record(String datasetName, Set<FieldMetadata> fieldMetadata); /** * Record the field level metadata for the given dataset in a given namespace. * * @param namespace The name of the namespace * @param datasetName The name of the Dataset * @param fieldMetadata The set of field level metadata information */ void record(String namespace, String datasetName, Set<FieldMetadata> fieldMetadata); }
DatasetContext interface can then extend the FieldMetadataRecorder interface so that programs can record the field level metadata information.
Pros:
- Keep the DatasetContext simpler and only adds two additional methods for recording metadata.
Cons:
- Additional method calls are required by user for recording the field level metadata information.
Option #2: Add new versions of the getDataset methods which will allow users to specify the set of field metadata information.
/** * Get an instance of the specified Dataset. * * @param name The name of the Dataset * @param fieldMetadata The field level metadata operations that are expected to be performed on the dataset * @param <T> The type of the Dataset * @return An instance of the specified Dataset, never null. * @throws DatasetInstantiationException If the Dataset cannot be instantiated: its class * cannot be loaded; the default constructor throws an exception; or the Dataset * cannot be opened (for example, one of the underlying tables in the DataFabric * cannot be accessed). */ <T extends Dataset> T getDataset(String name, Set<FieldMetadata> fieldMetadata) throws DatasetInstantiationException; /** * Get an instance of the specified Dataset. * * @param namespace The namespace of the Dataset * @param name The name of the Dataset * @param fieldMetadata The field level metadata operations that are expected to be performed on the dataset * @param <T> The type of the Dataset * @return An instance of the specified Dataset, never null. * @throws DatasetInstantiationException If the Dataset cannot be instantiated: its class * cannot be loaded; the default constructor throws an exception; or the Dataset * cannot be opened (for example, one of the underlying tables in the DataFabric * cannot be accessed). */ <T extends Dataset> T getDataset(String namespace, String name, Set<FieldMetadata> fieldMetadata) throws DatasetInstantiationException; /** * Get an instance of the specified Dataset. * * @param name The name of the Dataset * @param arguments the arguments for this dataset instance * @param fieldMetadata The field level metadata operations that are expected to be performed on the dataset * @param <T> The type of the Dataset * @return An instance of the specified Dataset, never null. * @throws DatasetInstantiationException If the Dataset cannot be instantiated: its class * cannot be loaded; the default constructor throws an exception; or the Dataset * cannot be opened (for example, one of the underlying tables in the DataFabric * cannot be accessed). */ <T extends Dataset> T getDataset(String name, Map<String, String> arguments, Set<FieldMetadata> fieldMetadata) throws DatasetInstantiationException; /** * Get an instance of the specified Dataset. * * @param namespace The namespace of Dataset * @param name The name of the Dataset * @param arguments the arguments for this dataset instance * @param fieldMetadata The field level metadata operations that are expected to be performed on the dataset * @param <T> The type of the Dataset * @return An instance of the specified Dataset, never null. * @throws DatasetInstantiationException If the Dataset cannot be instantiated: its class * cannot be loaded; the default constructor throws an exception; or the Dataset * cannot be opened (for example, one of the underlying tables in the DataFabric * cannot be accessed). */ <T extends Dataset> T getDataset(String namespace, String name, Map<String, String> arguments, Set<FieldMetadata> fieldMetadata) throws DatasetInstantiationException;
Pros:
- No need of separate methods to record the lineage. With additional parameter, getDataset method itself can be used to record the lineage information.
Cons:
- DatasetContext already has many different versions of the getDataset method which accepts different parameters. Addition of more methods with the new parameters can be confusing to the user.
Approach
Approach #1
Approach #2
API changes
New Programmatic APIs
New Java APIs introduced (both user facing and internal)
Deprecated Programmatic APIs
New REST APIs
Path | Method | Description | Response Code | Response |
---|---|---|---|---|
/v3/apps/<app-id> | GET | Returns the application spec for a given application | 200 - On success 404 - When application is not available 500 - Any internal errors |
|
Deprecated REST API
Path | Method | Description |
---|---|---|
/v3/apps/<app-id> | GET | Returns the application spec for a given application |
CLI Impact or Changes
- Impact #1
- Impact #2
- Impact #3
UI Impact or Changes
- Impact #1
- Impact #2
- Impact #3
Security Impact
What's the impact on Authorization and how does the design take care of this aspect
Impact on Infrastructure Outages
System behavior (if applicable - document impact on downstream [ YARN, HBase etc ] component failures) and how does the design take care of these aspect
Test Scenarios
Test ID | Test Description | Expected Results |
---|---|---|
Releases
Release X.Y.Z
Release X.Y.Z
Related Work
- Work #1
- Work #2
- Work #3