Skip to end of metadata
Go to start of metadata

You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 5 Next »

Checklist

  • User Stories Documented
  • User Stories Reviewed
  • Design Reviewed
  • APIs reviewed
  • Release priorities assigned
  • Test cases reviewed
  • Blog post

Introduction 

CDAP currently captures lineage at the dataset level. With lineage, users can tell the program that read from or wrote to a dataset. It can help users determine which program wrote to/read from a dataset in a given timeframe. They can keep drilling into either the upstream or the downstream direction.

However, as a platform, CDAP understands schemas for most datasets. Schemas contain fields. It would be useful to be able to drill into how a field in a particular dataset was used(CREATE/READ/WRITE/DELETE) in a given time period.

Goals

  • Provide CDAP platform support (in the form of API and storage) to track field level lineage.
  • Pipelines can then expose this functionality to the plugins.
  • Plugins (such as wrangler) will need to be updated to use this feature.

User Stories 

  • Breakdown of User-Stories 
  • User Story #1
  • User Story #2
  • User Story #3

Design

API to build the operations to be tracked at the field level.

/**
 * Enum that defines the type of operations that can be tracked for the lineage
 */
public enum OperationType {
	CREATE,  // New field is created
	READ, // Field is read
    UPDATE, // Field value is updated
    DELETE, // Field is deleted from the dataset
    RENAME // Field is renamed into some other field
}
 
public class FieldOperation {
	private final Field field; // Represents the field with associated schema
	private final OperationType type; // Represents operation to be performed on the field
	@Nullable
	private final Field renamedField; // If the operation type is RENAME which means the program is renaming the field 
}

Option #1: Provide an API in the DatasetContext interface which will allow users to specify the set of field operations.

/**
 * Get an instance of the specified Dataset.
 *
 * @param name The name of the Dataset
 * @param fieldOperations The field operations that are expected to be performed on the dataset
 * @param <T> The type of the Dataset
 * @return An instance of the specified Dataset, never null.
 * @throws DatasetInstantiationException If the Dataset cannot be instantiated: its class
 *         cannot be loaded; the default constructor throws an exception; or the Dataset
 *         cannot be opened (for example, one of the underlying tables in the DataFabric
 *         cannot be accessed).
 */
<T extends Dataset> T getDataset(String name, Set<FieldOperation> fieldOperations) throws DatasetInstantiationException;

/**
 * Get an instance of the specified Dataset.
 *
 * @param namespace The namespace of the Dataset
 * @param name The name of the Dataset
 * @param fieldOperations The field operations that are expected to be performed on the dataset
 * @param <T> The type of the Dataset
 * @return An instance of the specified Dataset, never null.
 * @throws DatasetInstantiationException If the Dataset cannot be instantiated: its class
 *         cannot be loaded; the default constructor throws an exception; or the Dataset
 *         cannot be opened (for example, one of the underlying tables in the DataFabric
 *         cannot be accessed).
 */
<T extends Dataset> T getDataset(String namespace, String name, Set<FieldOperation> fieldOperations) throws DatasetInstantiationException;

/**
 * Get an instance of the specified Dataset.
 *
 * @param name The name of the Dataset
 * @param arguments the arguments for this dataset instance
 * @param fieldOperations The field operations that are expected to be performed on the dataset
 * @param <T> The type of the Dataset
 * @return An instance of the specified Dataset, never null.
 * @throws DatasetInstantiationException If the Dataset cannot be instantiated: its class
 *         cannot be loaded; the default constructor throws an exception; or the Dataset
 *         cannot be opened (for example, one of the underlying tables in the DataFabric
 *         cannot be accessed).
 */
<T extends Dataset> T getDataset(String name, Map<String, String> arguments, Set<FieldOperation> fieldOperations) throws DatasetInstantiationException;

/**
 * Get an instance of the specified Dataset.
 *
 * @param namespace The namespace of Dataset
 * @param name The name of the Dataset
 * @param arguments the arguments for this dataset instance
 * @param fieldOperations The field operations that are expected to be performed on the dataset
 * @param <T> The type of the Dataset
 * @return An instance of the specified Dataset, never null.
 * @throws DatasetInstantiationException If the Dataset cannot be instantiated: its class
 *         cannot be loaded; the default constructor throws an exception; or the Dataset
 *         cannot be opened (for example, one of the underlying tables in the DataFabric
 *         cannot be accessed).
 */
<T extends Dataset> T getDataset(String namespace, String name, Map<String, String> arguments, Set<FieldOperation> fieldOperations)
  throws DatasetInstantiationException;


 

Option #2: Add new interface which will allow programs to record the field level lineage.

 

/**
 * This interface provides methods that will allow programs to record the lineage at field level
 */
public interface LineageRecorder {
    /**
 	 * Record the lineage for the given dataset.
 	 *
 	 * @param datasetName The name of the Dataset
 	 * @param operations The set of field level operations that will be performed on the dataset
 	*/
    void recordLineage(String datasetName, Set<FieldOperation> operations);
 
    /**
 	 * Record the lineage for the given dataset in a given namespace.
 	 *
 	 * @param namespace The name of the namespace
 	 * @param datasetName The name of the Dataset
 	 * @param operations The set of field level operations that will be performed on the dataset in the given namespace
 	*/
 	void recordLineage(String namespace, String datasetName, Set<FieldOperation> operations);
}

 

Approach

Approach #1

Approach #2

API changes

New Programmatic APIs

New Java APIs introduced (both user facing and internal)

Deprecated Programmatic APIs

New REST APIs

PathMethodDescriptionResponse CodeResponse
/v3/apps/<app-id>GETReturns the application spec for a given application

200 - On success

404 - When application is not available

500 - Any internal errors

 

     

Deprecated REST API

PathMethodDescription
/v3/apps/<app-id>GETReturns the application spec for a given application

CLI Impact or Changes

  • Impact #1
  • Impact #2
  • Impact #3

UI Impact or Changes

  • Impact #1
  • Impact #2
  • Impact #3

Security Impact 

What's the impact on Authorization and how does the design take care of this aspect

Impact on Infrastructure Outages 

System behavior (if applicable - document impact on downstream [ YARN, HBase etc ] component failures) and how does the design take care of these aspect

Test Scenarios

Test IDTest DescriptionExpected Results
   
   
   
   

Releases

Release X.Y.Z

Release X.Y.Z

Related Work

  • Work #1
  • Work #2
  • Work #3

 

Future work

  • No labels