Introduction
The following design documents:
- Metadata API to add metadata to elements inside CDAP which are not entities for example files in a fileset, schema fields in a dataset.
- Plugin/Program API to add metadata to elements inside CDAP which are not entities for example files in a fileset, schema fields in a dataset.
- Authorization for Metadata
Design
MetadataEntity Representation Format
One of the pain point of existing metadata APIs is that it only allow metadata to be associated with CDAP entities. This is very restrictive for enterprises who want capability to tag/discover any resources in CDAP (for example field of a schema, file of fileset) which are not CDAP entities.
To solve the earlier we will need to support a generic way of specifying metadata entities (entities and nonentities) in CDAP. For this we purpose the following generic way of specifying metadata entities for metadata annotations.
A map of string to string will allow user to specify any metadata entities in CDAP and also support existing CDAP entities. For example:
- Existing CDAP entities like a dataset with datasetId myDataset can be specified as:
Map<namespace=myNamespace, dataset=myDataset> - Field 'empName' of dataset 'myDataset' can be specified as:
Map<namespace=myNamespace, dataset=myDataset, field=empName - File 'part01' of a fileset 'myFileset' can be specified as:
Map<namespace=myNamespace, dataset=myFileset, file=part01> - The above free form map allows us to represent any resource in CDAP metadata system irrespective of whether it is present in CDAP or not. For example an external MySQL table can represented as:
Map<database=myDatabase, table=myTable>
@Beta public final class MetadataEntity { private List<KeyValue> details; private MetadataEntity(List<KeyValue> details) { this.details = details; } public List<KeyValue> getDetails() { return details; } public static Builder builder() { return new Builder(); } public static class Builder { private final List<KeyValue> details = new LinkedList<>(); public Builder add(String k, String v) { details.add(new KeyValue<>(k, v)); return this; } public Builder fromEntityId(EntityId entityId) { ... return this; } public MetadataEntity build() { return new Resource(Collections.unmodifiableList(details)); } } public static MetadataEntity fromEntityId(EntityId entityId) { // converts to EntityId to Metadata Resource } }
Overview of API changes
Our existing metadata API will need to change to allow user to specify the above generic metadata target. All our existing metadata APIs are built around EntityId as target for metadata system. Since EntityId are just a key-value pair with well defined key names they can easily be represented as a MetadataEntity presented above.
For example mapreduce program with ProgramId 'myProgram' can be represented as a map with the following key value pair:
Map<namespace=myNamespace, application=myApplication, appVersion=1.0.0-SNAPSHOT, programName=myProgram, programType=mapreduce>
CDAP internal Metadata APIs will be changed to accept a MetadataEntity rather than EntityId. For example the following APIs in MetadataAdmin
void addProperties(NamespacedEntityId namespacedEntityId, Map<String, String> properties) throws NotFoundException, InvalidMetadataException; void addTags(NamespacedEntityId namespacedEntityId, String... tags) throws NotFoundException, InvalidMetadataException; Set<MetadataRecord> getMetadata(NamespacedEntityId namespacedEntityId) throws NotFoundException;
will change to
void addProperties(MetadataEntity entity, Map<String, String> properties) throws NotFoundException, InvalidMetadataException; void addTags(MetadataEntity entity, String... tags) throws NotFoundException, InvalidMetadataException; Set<MetadataRecord> getMetadata(MetadataEntity entity) throws NotFoundException;
In addition to new metadata APIs we will also introduce we will also introduce new utility methods and public APIs which can allow user to add metadata by directly specifying EntityId and/or easily convert an EntityId to MetadataEntity for the metadata system. This has been shown in the MetadataEntity class documented earlier.
For backward compatibilty we will deprecate all the APIs which work with EntityId and change their implementation to convert EntityId to MetadataEntity.
Schema Fields as MetadataEntity
Allowing metadata to be associated with non-entities (MetadataEntity) will allow us to associate metadata with Schema fields. Schema fields are an important non-entites and it needs to be discussed how we will show associated metadata with Schema fields, retrieve them and display them in the UI.
Specifying Schema Field as Resource:
DatasetId myDataset = context.getDataset(....); MetadataEntity.Builder builder = MetadataEntity.Builder.fromEntityId(myDataset); builder.add("field", "EmployeeSSN"); MetadataEntity employeeSSNField = builder.build(); metadataClient.addTags(employeeSSNField, "PII");
Retrieving Schema Fields through Metadata Search:
When a user perform a search with metadata which is associated with schema field ideally we should display the schema field. In our current UI displaying non-entities is not supported so we will display the dataset asscoiated with it.
Program/Plugin Level APIs
@Override public void initialize() throws Exception { MapReduceContext context = getContext(); MetadataContext metadataContext = context.getMetadataContext(); metadataContext.addTags(resource, tags...); }
The MetadataContext which a developer will get here will be a RemoteMetadataClient which will discover the MetadataService through service discovery.
Plugin APIs
We will be extending the APIs for Lineage for Metadata.
We will introduce a new interface called PluginMetadataWriter
/** * Metadata Writer APIs for Plugins */ public interface PluginMetadataWriter { void addProperties(Map<String, String> properties); void addTags(String... tags); void addTags(Iterable<String> tags); void removeAllMetadata(); void removeProperties(); void removeProperties(String... keys); void removeTags(); void removeTags(String... tags); }
Destination from Lineage APIs will become
/** * Destination represents the dataset of which the fields will be part of. */ public class Destination implements PluginMetadataWriter { // Namespace associated with the Dataset. // This is required since programs can read the data from different namespace. String namespace; // Name of the Dataset String name; // Description associated with the Dataset. String description; // Properties associated with the Dataset. // This can potentially store plugin properties of the Sink for context. // For example in case of KafkaProducer sink, properties can include broker id, list of topics etc. Map<String, String> properties; // Metadata Information // Metadata information associated with Destination Dataset (The metadata for dataset only, individual schema // field metadata should be recorded as part of FieldOperation. Map<String, String> metadataProperties; Set<String> tags; @Override public void addProperties(Map<String, String> properties) { // adds to metadataProperties } @Override public void addTags(String... tags) { // adds to tags } @Override public void addTags(Iterable<String> tags) { // adds to tags } @Override public void removeAllMetadata() { // remove all metadata (properties and tags) } @Override public void removeProperties() { // clears metadataProperties } @Override public void removeProperties(String... keys) { // clears the given keys from metadataProperties } @Override public void removeTags() { // removes all tags } @Override public void removeTags(String... tags) { // removes the specified tags from tags } }
public class Input implements PluginMetadataWriter { // Schema field which is input to the operation Schema.Field field; // Source information if the field belongs to the source/dataset @Nullable Source source; // Create input from a Field. Since Schema can be nested, plain String cannot be // used to uniquely identify the Field in the Schema, as multiple Fields can have same name // but different nesting. In order to uniquely identify a Field from the Schema we will // need an enhancement in the CDAP platform so that each Field can hold the transient // reference to the parent Field. From these references, then we can create unique field path. public static Input ofField(Schema.Field field) { } // Create input from the Field which belongs to the Source public static Input ofField(Source source, Schema.Field field) { } // Metadata information associated with schema Field Map<String, String> metadataProperties; Set<String> tags; @Override public void addProperties(Map<String, String> properties) { // adds to metadataProperties } @Override public void addTags(String... tags) { // adds to tags } @Override public void addTags(Iterable<String> tags) { // adds to tags } @Override public void removeAllMetadata() { // remove all metadata (properties and tags) } @Override public void removeProperties() { // clears metadataProperties } @Override public void removeProperties(String... keys) { // clears the given keys from metadataProperties } @Override public void removeTags() { // removes all tags } @Override public void removeTags(String... tags) { // removes the specified tags from tags } }
We can modify the LineageRecorder interface to support recording for both metadata and lineage in implementation.
Authorization for Metadata
Allowing metadata to be added to CDAP MetadataEntity (non-entities) opens the question about authorization enforcement (i.e. who can add metadata to these resources). Since these resources are not entities we cannot have policy defined for them as of now.
Even though CDAP MetadataEntity are not predefined we can depend on the fact that these resources are generally under some CDAP EntityId. For example schema fields are always associated with a dataset, file in a fileset is always associated with dataset itself. If such a relationship does not exist we can depend on the fact that resources exists under a namespace and we can perform authorization on these EntityIds. In case of external resources which does not even exist under a namespace we can enforce on InstanceId if needed.
Since Metadata always belong to some MetadataEntity (EntityId or non-entity ids) in CDAP the enforcement will de done on the EntityId (see above as how we will determine the entity to enforce on in case of MetadataEntity which is not EntityId).
Operation | Privilege Required |
---|---|
Get Metadata (Property/Tag) | READ on the Entity with which the metadata is associated |
Add Metadata (Property/Tag) | WRITE on the Entity to which the metadata is being added |
Remove Metadata (Property/Tag) | WRITE/ADMIN on the Entity from which the metadata is being removed |
Metadata in Transaction
In CDAP 5.0 we introduce the capability of adding metadata from program/pipeline runs. This raises the questions of what happens to the metadata added in a pipeline run which failed. Metadata added by pipeline runs which have failed will be retained. Since, as of now we expect metadata to be added to schema fields rather than the data records itself (written in a pipeline run which might fail leading to no data being written) we can say that they are not related to each other. Although, in case of a conditional metadata annotation for examlple tagging a schema field with a tag like "high" only if any of the entry written to the field have a value more than 100 will not be lead to expected results. With a fail pipeline run we will end up with schema field tagged with "high" even though none of the data records have value greater than 100.
Metadata for Versions
Our initial thought was to have metadata independent of application/artificat version to keep the behavior consistent with authorization policy but there are use cases where this model will not serve very well. For example an enterprise have two version of various applicatio deployed in their CDAP instance. Once version is in production and another is in development and is being actively developed. In such scenario a user might want to tag all of the development version of application with say tag "dev" and all production version with "prod" allowing the user to later discover them thorugh search. Making metadata version independent will not work for this use case. So, we will support two way of adding metadata to versioned application/artificat.
- If while adding the metadata to an application/artificat a version is not specified then that metadata will be added to all the existing version.
- If while adding the metadata a version for application/artifact is specified then that metadata will be added to only the specified version of application/artifact.
Special Character Support for Metadata
Currently, our metadata system only allow a-z, 0-9 and - characters. This put sever restriction on the user as what they can store in our metadata system. For example if someone want to add a metadata value which has commas in it the current system will not allow it. We will extend out current metadata system to allow common special characters to be stored. This will require changes in characters which we use as separator while sotring metadata.
Metadata Storage and Indexing
The Metadata Sorage and Indexing will be exactly similar to as it is now. As of now, we don't plan to do any storage or indexing improvements.
Although, now all our Metadata APIs will be be dealing with MetadataEntity rather than NamespacedEntityId (EntityId) which will require that we store MetadataEntity rather than NamespacedEntityId in our MetadataStore.
In our current approach we create a MDSKey from NamespacedEntityId. The MDSKey is just a String representation of the NamespacedEntityId. Since the MetadataEntity is just another free form representation of NamespaceEntityId we can just change the MDSKey to be able to take MetadataEntity and generata a similar string representation for the entity.
This approach will keep the metadata store backward compatible and will not require any upgrade or migration step.
Open Questions
- How does metadata for schema applied to external sinks (dataset) which CDAP does not know about like kudu table?
> Associated with external datasets. - What are the different possibilities of search?
- Do we need to support mathematical operators such as >, <, <= etc. In this case the data needs to be treated as numbers. Does the user need to specify the type of metadata being added.
- Do we need to support relational operator in search queries. For example: List all datasets
- Metadata now has class/type (business, operational, technical) do we need capabilities to filter metadata on this?
- How are resources like files, partition etc which are not cdap entities and cdap does not know about them are presented in UI when discovered through metadata.
> To be designed
New REST APIs
- As documented in the design. New APIs will be added to support interacting with metadata on non-entities.
Deprecated REST API
- All existing Metadata APIs which are based on EntityId
CLI Impact or Changes
- The CLI will need to support metadata being associted with non-entities.
UI Impact or Changes
- UI should be able to show metadata for non-entities.
Security Impact
- Currently Metadata does not have authorization. We will be adding authorization to metadata.
Impact on Infrastructure Outages
System behavior (if applicable - document impact on downstream [ YARN, HBase etc ] component failures) and how does the design take care of these aspect
Test Scenarios
Test ID | Test Description | Expected Results |
---|---|---|
Releases
Release X.Y.Z
Release X.Y.Z
Related Work
- Work #1
- Work #2
- Work #3