Table of Contents |
---|
Checklist
- User Stories Documented
- User Stories Reviewed
- Design Reviewed
- APIs reviewed
- Release priorities assigned
- Test cases reviewed
- Blog post
Introduction
CDAP currently captures lineage at the dataset level. With lineage, users can tell the program that read from or wrote to a dataset. It can help users determine which program wrote to/read from a dataset in a given timeframe. They can keep drilling into either the upstream or the downstream direction.
However, as a platform, CDAP understands schemas for most datasets. Schemas contain fields. It would be useful to be able to drill into how a field in a particular dataset was used(CREATE/READ/WRITE/DELETE) in a given time period.
Goals
- Provide CDAP platform support (in the form of API and storage) to track field level lineage.
- Pipelines can then expose this functionality to the plugins.
- Plugins (such as wrangler) will need to be updated to use this feature.
User Stories
- Breakdown of User-Stories
- User Story #1
- User Story #2
- User Story #3
Design
API to build the operations to be tracked at the field level.
Code Block | ||
---|---|---|
| ||
/** * Enum that defines the type of operations that can be tracked for the lineage */ public enum OperationType { CREATE, // New field is created READ, // Field is read UPDATE, // Field value is updated DELETE, // Field is deleted from the dataset RENAME // Field is renamed into some other field } public class FieldOperationsFieldOperation { private final Field field; // Represents the field with associated schema private final OperationType type; // Represents operation to be performed on the field @Nullable private final Field renamedField; // If the operation type is RENAME which means the program is renaming the field } |
Option 1: Provide an API in the DatasetContext interface which will allow users to specify the set of field mutationsoperations.
Code Block | ||
---|---|---|
| ||
/** * Get an instance of the specified Dataset. * * @param name The name of the Dataset * @param mutationsfieldOperations The field mutationsoperations that are expected to be performed on thisthe dataset * @param <T> The type of the Dataset * @return An instance of the specified Dataset, never null. * @throws DatasetInstantiationException If the Dataset cannot be instantiated: its class * cannot be loaded; the default constructor throws an exception; or the Dataset * cannot be opened (for example, one of the underlying tables in the DataFabric * cannot be accessed). */ <T extends Dataset> T getDataset(String name, FieldMutationsSet<FieldOperation> mutationsfieldOperations) throws DatasetInstantiationException; /** * Get an instance of the specified Dataset. * * @param namespace The namespace of the Dataset * @param name The name of the Dataset * @param mutationsfieldOperations The field mutationsoperations that are expected to be performed on thisthe dataset * @param <T> The type of the Dataset * @return An instance of the specified Dataset, never null. * @throws DatasetInstantiationException If the Dataset cannot be instantiated: its class * cannot be loaded; the default constructor throws an exception; or the Dataset * cannot be opened (for example, one of the underlying tables in the DataFabric * cannot be accessed). */ <T extends Dataset> T getDataset(String namespace, String name, FieldMutationsSet<FieldOperation> mutationsfieldOperations) throws DatasetInstantiationException; /** * Get an instance of the specified Dataset. * * @param name The name of the Dataset * @param arguments the arguments for this dataset instance * @param mutationsfieldOperations The field mutationsoperations that are expected to be performed on thisthe dataset * @param <T> The type of the Dataset * @return An instance of the specified Dataset, never null. * @throws DatasetInstantiationException If the Dataset cannot be instantiated: its class * cannot be loaded; the default constructor throws an exception; or the Dataset * cannot be opened (for example, one of the underlying tables in the DataFabric * cannot be accessed). */ <T extends Dataset> T getDataset(String name, Map<String, String> arguments, FieldMutationsSet<FieldOperation> mutationsfieldOperations) throws DatasetInstantiationException; /** * Get an instance of the specified Dataset. * * @param namespace The namespace of Dataset * @param name The name of the Dataset * @param arguments the arguments for this dataset instance * @param mutationsfieldOperations The field mutationsoperations that are expected to be performed on thisthe dataset * @param <T> The type of the Dataset * @return An instance of the specified Dataset, never null. * @throws DatasetInstantiationException If the Dataset cannot be instantiated: its class * cannot be loaded; the default constructor throws an exception; or the Dataset * cannot be opened (for example, one of the underlying tables in the DataFabric * cannot be accessed). */ <T extends Dataset> T getDataset(String namespace, String name, Map<String, String> arguments, FieldMutationsSet<FieldOperation> mutationsfieldOperations) throws DatasetInstantiationException; |
Approach
Approach #1
Approach #2
API changes
New Programmatic APIs
New Java APIs introduced (both user facing and internal)
Deprecated Programmatic APIs
New REST APIs
Path | Method | Description | Response Code | Response |
---|---|---|---|---|
/v3/apps/<app-id> | GET | Returns the application spec for a given application | 200 - On success 404 - When application is not available 500 - Any internal errors |
|
Deprecated REST API
Path | Method | Description |
---|---|---|
/v3/apps/<app-id> | GET | Returns the application spec for a given application |
CLI Impact or Changes
- Impact #1
- Impact #2
- Impact #3
UI Impact or Changes
- Impact #1
- Impact #2
- Impact #3
Security Impact
What's the impact on Authorization and how does the design take care of this aspect
Impact on Infrastructure Outages
System behavior (if applicable - document impact on downstream [ YARN, HBase etc ] component failures) and how does the design take care of these aspect
Test Scenarios
Test ID | Test Description | Expected Results |
---|---|---|
Releases
Release X.Y.Z
Release X.Y.Z
Related Work
- Work #1
- Work #2
- Work #3