Checklist
- User Stories Documented
- User Stories Reviewed
- Design Reviewed
- APIs reviewed
- Release priorities assigned
- Test cases reviewed
- Blog post
Introduction
CDAP currently captures lineage at the dataset level. With lineage, users can tell the program that read from or wrote to a dataset. It can help users determine which program wrote to/read from a dataset in a given timeframe. They can keep drilling into either the upstream or the downstream direction.
However, as a platform, CDAP understands schemas for most datasets. Schemas contain fields. It would be useful to be able to drill into how a field in a particular dataset was used(CREATE/READ/WRITE/DELETE) in a given time period.
Goals
- Provide CDAP platform support (in the form of API and storage) to track field level lineage.
- Pipelines can then expose this functionality to the plugins.
- Plugins (such as wrangler) will need to be updated to use this feature.
User Stories
Id | User Story |
---|---|
FLL-1 | As a data governance reviewer or information architect at a financial institution, I would like to generate a report of how a PII field UID from the dataset DailyTransactions was consumed in the specified time period so that
|
FLL-2 | As a data scientist at a healthcare organization, I would like to trace the provenance of the field patient_medical_score in the dataset PatientRecords over the last month so that
|
FLL-3 | As the developer of a plugin that defines some transformations on the source in a pipeline, I would like to be able to register that I performed a particular kind of operation (READ/WRITE) on an input field which generated an output field at a given instant so that
|
Design
API to build the operations to be tracked at the field level.
/** * Following enum is already available in the CDAP which can be made part of the cdap-api. * This type can be used to track the field level lineage as well. */ public enum AccessType { READ ('r'), WRITE ('w'), READ_WRITE ('a'), UNKNOWN('u'); } /** * Following class can be used in the program to provide the field level business metadata which includes tags, properties, and lineage information. */ public class FieldMetadata { private final Field field; // Represents the field with associated schema private final AccessType type; // Represents access to be performed on the field which will be used for tracking the lineage private final Set<String> tags; // Tags associated with the field private final Map<String, String> properties; // Additional metadata properties associated with the field }
Option #1: Add new interface which will allow programs to record the field level lineage.
/** * This interface provides methods that will allow programs to record the metadata at the field level. */ public interface FieldMetadataRecorder { /** * Record the field level metadata for the given dataset. * * @param datasetName The name of the Dataset * @param fieldMetadata The set of field level metadata information */ void record(String datasetName, Set<FieldMetadata> fieldMetadata); /** * Record the field level metadata for the given dataset in a given namespace. * * @param namespace The name of the namespace * @param datasetName The name of the Dataset * @param fieldMetadata The set of field level metadata information */ void record(String namespace, String datasetName, Set<FieldMetadata> fieldMetadata); }
DatasetContext interface can then extend the FieldMetadataRecorder interface so that programs can record the field level metadata information.
Pros:
- Keep the DatasetContext simpler and only adds two additional methods for recording metadata.
Cons:
- Additional method calls are required by user for recording the field level metadata information.
Option #2: Add new versions of the getDataset methods which will allow users to specify the set of field metadata information.
/** * Get an instance of the specified Dataset. * * @param name The name of the Dataset * @param fieldMetadata The field level metadata operations that are expected to be performed on the dataset * @param <T> The type of the Dataset * @return An instance of the specified Dataset, never null. * @throws DatasetInstantiationException If the Dataset cannot be instantiated: its class * cannot be loaded; the default constructor throws an exception; or the Dataset * cannot be opened (for example, one of the underlying tables in the DataFabric * cannot be accessed). */ <T extends Dataset> T getDataset(String name, Set<FieldMetadata> fieldMetadata) throws DatasetInstantiationException; /** * Get an instance of the specified Dataset. * * @param namespace The namespace of the Dataset * @param name The name of the Dataset * @param fieldMetadata The field level metadata operations that are expected to be performed on the dataset * @param <T> The type of the Dataset * @return An instance of the specified Dataset, never null. * @throws DatasetInstantiationException If the Dataset cannot be instantiated: its class * cannot be loaded; the default constructor throws an exception; or the Dataset * cannot be opened (for example, one of the underlying tables in the DataFabric * cannot be accessed). */ <T extends Dataset> T getDataset(String namespace, String name, Set<FieldMetadata> fieldMetadata) throws DatasetInstantiationException; /** * Get an instance of the specified Dataset. * * @param name The name of the Dataset * @param arguments the arguments for this dataset instance * @param fieldMetadata The field level metadata operations that are expected to be performed on the dataset * @param <T> The type of the Dataset * @return An instance of the specified Dataset, never null. * @throws DatasetInstantiationException If the Dataset cannot be instantiated: its class * cannot be loaded; the default constructor throws an exception; or the Dataset * cannot be opened (for example, one of the underlying tables in the DataFabric * cannot be accessed). */ <T extends Dataset> T getDataset(String name, Map<String, String> arguments, Set<FieldMetadata> fieldMetadata) throws DatasetInstantiationException; /** * Get an instance of the specified Dataset. * * @param namespace The namespace of Dataset * @param name The name of the Dataset * @param arguments the arguments for this dataset instance * @param fieldMetadata The field level metadata operations that are expected to be performed on the dataset * @param <T> The type of the Dataset * @return An instance of the specified Dataset, never null. * @throws DatasetInstantiationException If the Dataset cannot be instantiated: its class * cannot be loaded; the default constructor throws an exception; or the Dataset * cannot be opened (for example, one of the underlying tables in the DataFabric * cannot be accessed). */ <T extends Dataset> T getDataset(String namespace, String name, Map<String, String> arguments, Set<FieldMetadata> fieldMetadata) throws DatasetInstantiationException;
Pros:
- No need of separate methods to record the lineage. With additional parameter, getDataset method itself can be used to record the lineage information.
Cons:
- DatasetContext already has many different versions of the getDataset method which accepts different parameters. Addition of more methods with the new parameters can be confusing to the user.
Approach
Approach #1
Approach #2
API changes
New Programmatic APIs
New Java APIs introduced (both user facing and internal)
Deprecated Programmatic APIs
New REST APIs
Path | Method | Description | Response Code | Response |
---|---|---|---|---|
/v3/apps/<app-id> | GET | Returns the application spec for a given application | 200 - On success 404 - When application is not available 500 - Any internal errors |
|
Deprecated REST API
Path | Method | Description |
---|---|---|
/v3/apps/<app-id> | GET | Returns the application spec for a given application |
CLI Impact or Changes
- Impact #1
- Impact #2
- Impact #3
UI Impact or Changes
- Impact #1
- Impact #2
- Impact #3
Security Impact
What's the impact on Authorization and how does the design take care of this aspect
Impact on Infrastructure Outages
System behavior (if applicable - document impact on downstream [ YARN, HBase etc ] component failures) and how does the design take care of these aspect
Test Scenarios
Test ID | Test Description | Expected Results |
---|---|---|
Releases
Release X.Y.Z
Release X.Y.Z
Related Work
- Work #1
- Work #2
- Work #3