Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Table of Contents

Checklist

  •  User Stories Documented
  •  User Stories Reviewed
  •  Design Reviewed
  •  APIs reviewed
  •  Release priorities assigned
  •  Test cases reviewed
  •  Blog post

Introduction 

CDAP currently captures lineage at the dataset level. With lineage, users can tell the program that read from or wrote to a dataset. It can help users determine which program wrote to/read from a dataset in a given timeframe. They can keep drilling into either the upstream or the downstream direction.

However, as a platform, CDAP understands schemas for most datasets. Schemas contain fields. It would be useful to be able to drill into how a field in a particular dataset was used(CREATE/READ/WRITE/DELETE) in a given time period.

Goals

  • Provide CDAP platform support (in the form of API and storage) to track field level lineage.
  • Pipelines can then expose this functionality to the plugins.
  • Plugins (such as wrangler) will need to be updated to use this feature.

User Stories 

  • Breakdown of User-Stories 
  • User Story #1
  • User Story #2
  • User Story #3

Design

API to build the operations to be tracked at the field level.

Code Block
languagejava
/**
 * Enum that defines the type of operations that can be tracked for the lineage
 */
public enum OperationType {
	CREATE,  // New field is created
	READ, // Field is read
    UPDATE, // Field value is updated
    DELETE, // Field is deleted from the dataset
    RENAME // Field is renamed into some other field
}
 
public class FieldOperations {
	private final Field field; // Represents the field with associated schema
	private final OperationType type; // Represents operation to be performed on the field
	@Nullable
	private final Field renamedField; // If the operation type is RENAME which means the program is renaming the field 
}

 

Option 1: Provide an API in the DatasetContext interface which will allow users to specify the set of field mutations.

 

Code Block
languagejava
/**
 * Get an instance of the specified Dataset.
 *
 * @param name The name of the Dataset
 * @param mutations The field mutations that are expected on this dataset
 * @param <T> The type of the Dataset
 * @return An instance of the specified Dataset, never null.
 * @throws DatasetInstantiationException If the Dataset cannot be instantiated: its class
 *         cannot be loaded; the default constructor throws an exception; or the Dataset
 *         cannot be opened (for example, one of the underlying tables in the DataFabric
 *         cannot be accessed).
 */
<T extends Dataset> T getDataset(String name, FieldMutations mutations) throws DatasetInstantiationException;

/**
 * Get an instance of the specified Dataset.
 *
 * @param namespace The namespace of the Dataset
 * @param name The name of the Dataset
 * @param mutations The field mutations that are expected on this dataset
 * @param <T> The type of the Dataset
 * @return An instance of the specified Dataset, never null.
 * @throws DatasetInstantiationException If the Dataset cannot be instantiated: its class
 *         cannot be loaded; the default constructor throws an exception; or the Dataset
 *         cannot be opened (for example, one of the underlying tables in the DataFabric
 *         cannot be accessed).
 */
<T extends Dataset> T getDataset(String namespace, String name, FieldMutations mutations) throws DatasetInstantiationException;

/**
 * Get an instance of the specified Dataset.
 *
 * @param name The name of the Dataset
 * @param arguments the arguments for this dataset instance
 * @param mutations The field mutations that are expected on this dataset
 * @param <T> The type of the Dataset
 * @return An instance of the specified Dataset, never null.
 * @throws DatasetInstantiationException If the Dataset cannot be instantiated: its class
 *         cannot be loaded; the default constructor throws an exception; or the Dataset
 *         cannot be opened (for example, one of the underlying tables in the DataFabric
 *         cannot be accessed).
 */
<T extends Dataset> T getDataset(String name, Map<String, String> arguments, FieldMutations mutations) throws DatasetInstantiationException;

/**
 * Get an instance of the specified Dataset.
 *
 * @param namespace The namespace of Dataset
 * @param name The name of the Dataset
 * @param arguments the arguments for this dataset instance
 * @param mutations The field mutations that are expected on this dataset
 * @param <T> The type of the Dataset
 * @return An instance of the specified Dataset, never null.
 * @throws DatasetInstantiationException If the Dataset cannot be instantiated: its class
 *         cannot be loaded; the default constructor throws an exception; or the Dataset
 *         cannot be opened (for example, one of the underlying tables in the DataFabric
 *         cannot be accessed).
 */
<T extends Dataset> T getDataset(String namespace, String name, Map<String, String> arguments, FieldMutations mutations)
  throws DatasetInstantiationException;


 

 

Approach

Approach #1

Approach #2

API changes

New Programmatic APIs

New Java APIs introduced (both user facing and internal)

Deprecated Programmatic APIs

New REST APIs

PathMethodDescriptionResponse CodeResponse
/v3/apps/<app-id>GETReturns the application spec for a given application

200 - On success

404 - When application is not available

500 - Any internal errors

 

     

Deprecated REST API

PathMethodDescription
/v3/apps/<app-id>GETReturns the application spec for a given application

CLI Impact or Changes

  • Impact #1
  • Impact #2
  • Impact #3

UI Impact or Changes

  • Impact #1
  • Impact #2
  • Impact #3

Security Impact 

What's the impact on Authorization and how does the design take care of this aspect

Impact on Infrastructure Outages 

System behavior (if applicable - document impact on downstream [ YARN, HBase etc ] component failures) and how does the design take care of these aspect

Test Scenarios

Test IDTest DescriptionExpected Results
   
   
   
   

Releases

Release X.Y.Z

Release X.Y.Z

Related Work

  • Work #1
  • Work #2
  • Work #3

 

Future work