Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Table of Contents

Introduction 

The following design documents:

...

Table of Contents

Introduction 

The following design documents:

  1. Metadata API to add metadata to elements inside CDAP which are not entities for example files in a fileset, schema fields in a dataset.
  2. Authorization for Metadata

Design

  1. Plugin/Program API to add metadata to elements inside CDAP which are not entities for example files in a fileset, schema fields in a dataset.
  2. Authorization for Metadata

Design

MetadataEntity Representation Format

...

Code Block
languagejava
public class Input implements PluginMetadataWriter {
  // Schema field which is input to the operation
  Schema.Field field;

  // Source information if the field belongs to the source/dataset
  @Nullable
  Source source;

  // Create input from a Field. Since Schema can be nested, plain String cannot be
  // used to uniquely identify the Field in the Schema, as multiple Fields can have same name
  // but different nesting. In order to uniquely identify a Field from the Schema we will
  // need an enhancement in the CDAP platform so that each Field can hold the transient
  // reference to the parent Field. From these references, then we can create unique field path.
  public static Input ofField(Schema.Field field) {
  }

  // Create input from the Field which belongs to the Source
  public static Input ofField(Source source, Schema.Field field) {

  }

  // Metadata information associated with schema Field
  Map<String, String> metadataProperties;
  Set<String> tags;


  @Override
  public void addProperties(Map<String, String> properties) {
    // adds to  metadataProperties
  }

  @Override
  public void addTags(String... tags) {
    // adds to tags
  }

  @Override
  public void addTags(Iterable<String> tags) {
    // adds to tags
  }

  @Override
  public void removeAllMetadata() {
    // remove all metadata (properties and tags)
  }

  @Override
  public void removeProperties() {
    // clears metadataProperties
  }

  @Override
  public void removeProperties(String... keys) {
    // clears the given keys from metadataProperties
  }

  @Override
  public void removeTags() {
    // removes all tags
  }

  @Override
  public void removeTags(String... tags) {
    // removes the specified tags from tags
  }
}

 

We can modify the LineageRecorder interface to support recording for both metadata and lineage in implementation.

Authorization for Metadata

Allowing metadata to be added to CDAP MetadataEntity (non-entities) opens the question about authorization enforcement (i.e. who can add metadata to these resources). Since these resources are not entities we cannot have policy defined for them as of now.

Even though CDAP MetadataEntity are not predefined we can depend on the fact that these resources are generally under some CDAP EntityId. For example schema fields are always associated with a dataset, file in a fileset is always associated with dataset itself. If such a relationship does not exist we can depend on the fact that resources exists under a namespace and we can perform authorization on these EntityIds. In case of external resources which does not even exist under a namespace we can enforce on InstanceId if needed.

Since Metadata always belong to some MetadataEntity (EntityId or non-entity ids) in CDAP the enforcement will de done on the EntityId (see above as how we will determine the entity to enforce on in case of MetadataEntity which is not EntityId). 

OperationPrivilege Required
Get Metadata (Property/Tag)READ on the Entity with which the metadata is associated
Add Metadata (Property/Tag)WRITE on the Entity to which the metadata is being added
Remove Metadata (Property/Tag)WRITE/ADMIN on the Entity from which the metadata is being removed

 

Metadata in Transaction

In CDAP 5.0 we introduce the capability of adding metadata from program/pipeline runs. This raises the questions of what happens to the metadata added in a pipeline run which failed. Metadata added by pipeline runs which have failed will be retained. Since, as of now we expect metadata to be added to schema fields rather than the data records itself (written in a pipeline run which might fail leading to no data being written) we can say that they are not related to each other. Although, in case of a conditional metadata annotation for examlple tagging a schema field with a tag like "high" only if any of the entry written to the field have a value more than 100 will not be lead to expected results. With a fail pipeline run we will end up with schema field tagged with "high" even though none of the data records have value greater than 100.

Metadata for Versions

Our initial thought was to have metadata independent of application/artificat version to keep the behavior consistent with authorization policy but there are use cases where this model will not serve very well. For example an enterprise have two version of various applicatio deployed in their CDAP instance. Once version is in production and another is in development and is being actively developed. In such scenario a user might want to tag all of the development version of application with say tag "dev" and all production version with "prod" allowing the user to later discover them thorugh search. Making metadata version independent will not work for this use case. So, we will support two way of adding metadata to versioned application/artificat.

  1. If while adding the metadata to an application/artificat a version is not specified then that metadata will be added to all the existing version.
  2. If while adding the metadata a version for application/artifact is specified then that metadata will be added to only the specified version of application/artifact.

Special Character Support for Metadata

Currently, our metadata system only allow a-z, 0-9 and - characters. This put sever restriction on the user as what they can store in our metadata system. For example if someone want to add a metadata value which has commas in it the current system will not allow it. We will extend out current metadata system to allow common special characters to be stored. This will require changes in characters which we use as separator while sotring metadata. 

Metadata Storage and Indexing

The Metadata Sorage and Indexing will be exactly similar to as it is now. As of now, we don't plan to do any storage or indexing improvements.

Although, now all our Metadata APIs will be be dealing with MetadataEntity rather than NamespacedEntityId (EntityId) which will require that we store MetadataEntity rather than NamespacedEntityId in our MetadataStore. 

In our current approach we create a MDSKey from NamespacedEntityId. The MDSKey is just a String representation of the NamespacedEntityId. Since the MetadataEntity is just another free form representation of NamespaceEntityId we can just change the MDSKey to be able to take MetadataEntity and generata a similar string representation for the entity.

...

 removeTags(String... tags) {
    // removes the specified tags from tags
  }
}

 

We can modify the LineageRecorder interface to support recording for both metadata and lineage in implementation.

Authorization for Metadata

Allowing metadata to be added to CDAP MetadataEntity (non-entities) opens the question about authorization enforcement (i.e. who can add metadata to these resources). Since these resources are not entities we cannot have policy defined for them as of now.

Even though CDAP MetadataEntity are not predefined we can depend on the fact that these resources are generally under some CDAP EntityId. For example schema fields are always associated with a dataset, file in a fileset is always associated with dataset itself. If such a relationship does not exist we can depend on the fact that resources exists under a namespace and we can perform authorization on these EntityIds. In case of external resources which does not even exist under a namespace we can enforce on InstanceId if needed.

Since Metadata always belong to some MetadataEntity (EntityId or non-entity ids) in CDAP the enforcement will de done on the EntityId (see above as how we will determine the entity to enforce on in case of MetadataEntity which is not EntityId). 

OperationPrivilege Required
Get Metadata (Property/Tag)READ on the Entity with which the metadata is associated
Add Metadata (Property/Tag)WRITE on the Entity to which the metadata is being added
Remove Metadata (Property/Tag)WRITE/ADMIN on the Entity from which the metadata is being removed

 

Metadata in Transaction

In CDAP 5.0 we introduce the capability of adding metadata from program/pipeline runs. This raises the questions of what happens to the metadata added in a pipeline run which failed. Metadata added by pipeline runs which have failed will be retained. Since, as of now we expect metadata to be added to schema fields rather than the data records itself (written in a pipeline run which might fail leading to no data being written) we can say that they are not related to each other. Although, in case of a conditional metadata annotation for examlple tagging a schema field with a tag like "high" only if any of the entry written to the field have a value more than 100 will not be lead to expected results. With a fail pipeline run we will end up with schema field tagged with "high" even though none of the data records have value greater than 100.

Metadata for Versions

Our initial thought was to have metadata independent of application/artificat version to keep the behavior consistent with authorization policy but there are use cases where this model will not serve very well. For example an enterprise have two version of various applicatio deployed in their CDAP instance. Once version is in production and another is in development and is being actively developed. In such scenario a user might want to tag all of the development version of application with say tag "dev" and all production version with "prod" allowing the user to later discover them thorugh search. Making metadata version independent will not work for this use case. So, we will support two way of adding metadata to versioned application/artificat.

  1. If while adding the metadata to an application/artificat a version is not specified then that metadata will be added to all the existing version.
  2. If while adding the metadata a version for application/artifact is specified then that metadata will be added to only the specified version of application/artifact.

Special Character Support for Metadata

Currently, our metadata system only allow a-z, 0-9 and - characters. This put sever restriction on the user as what they can store in our metadata system. For example if someone want to add a metadata value which has commas in it the current system will not allow it. We will extend out current metadata system to allow common special characters to be stored. This will require changes in characters which we use as separator while sotring metadata. 

Metadata Storage and Indexing

In the current implementation of MetadataDataset, the key which is stored is a toString representation of the EntityId i.e.
EntityType.entitydetails.key For example for a dataset it looks like

Code Block
<length-encoding>DatasetInstance<length-encoding>namespaceName<length-encoding>datasetName<length-encoding>metadataKey

Note: We store the old Id representation and not EntityIds to keep backward compatibility with serialized keys from before. During this release when we will be upgrading the metadata store we should defenitely migrate all the keys to not use old Ids and use a serialization form which is independent of EntityIds.

For more information please refer to earlier design documentation of our metadata store and the implementation here:

Storage Design

MdsKey

With the proposed changed in this design document we will introduce a class called MetadataEntity which will be a List of key-value pairs. In a simple represetation it will look like:

Code Block
<length-encoding>namespace<length-encoding>nsOne<length-encoding>dataset<length-encoding>dsOne<length-encoding>metadataKey

 

Also for a file in PFS it will look something like this

Code Block
<length-encoding>namespace<length-encoding>nsOne<length-encoding>dataset<length-encoding>dsOne<length-encoding>partition<length-encoding>partitionOne<length-encoding>file<length-encoding>fileOne<length-encoding>metadataKey


We cannot store this with our current storage key as the key be something like this:

file:nsOne.datasetOne.PartitionOne.FileOne

Since files are not an EntityId in CDAP, CDAP does not know the hireracy of this custom entity type. Hence CDAP will not be able consturct the MetadataEntity back since all the individual keys are not persisted in the above format. To solve this issue we will now store the MetadataEntity information with all the key-value pairs. To maintain backward compatibility and support search based on the entity type we will also be storing the information where the key is prefixed by the target entity type as earlier. So finally the key will look something like this:

Code Block
<length-encoding>file<length-encoding>namespace<length-encoding>nsOne<length-encoding>dataset<length-encoding>dsOne<length-encoding>partition<length-encoding>partitionOne<length-encoding>file<length-encoding>fileOne<length-encoding>metadataKey

It should be noted that it is important to store the keys prefixed by the type because it limits our scan size when we retrieve metadata for an entity/non-entity. For example consider the following scenario

Lets say myStreamOne is tagged with myTagOne and myTagTwo and myStreamViewOne is tagged with myTagThree

EntityType:EntityDetails.MoreEntityDetails.MetadataKey
So it looks like this: (Note the : and . are just for readability current we store length encoding)

Code Block
stream:myNamespaceOne.myStreamOne.myTagOne
stream:myNamespaceOne.myStreamOne.myTagTwo
stream_view:myNamespaceOne.myStreamOne.myStreamViewOne.myTagThree


If we change it store key-value parts (without entity-type prefix) of entities the above will look like:

Code Block
namespace=myNamespaceOne.stream=myStreamOne.myTagOne
namespace=myNamespaceOne.stream=myStreamOne.myTagTwo
namespace=myNamespaceOne.stream=myStreamOne.stream_view=myStreamViewOne.myTagThree


Now when someone says give me all the metadata for MyStreamOne we do a prefix based search to collect all the metadata keys where the search prefix is (in current implementation)

stream:myNamespaceOne.myStreamOne.

With our MetadataEntity change the search prefix will look like this:

namespace=myNamespaceOne.stream=myStreamOne.

The problem with above new key is that it will also match
namespace=myNamespaceOne.stream=myStreamOne.stream_view=myStreamViewOne.myTagThree

and give us the metadata for stream view which is child of the stream. We can of course filter them out as a post-processing step but this is very bad for searches for namespaces because this will give metadata for everything inside namespace. Such large scan result can easily be eliminated if we store the keys prefixed by entity-type. If an entity-type is not known then we can store it as a some constant like UNKNOWN_TYPE.

 

Search Queries:

We will maintain support for all search queries as listed here for backward compatibility. No new search capabilites will be added.

 

Upgrade:

We will need an upgrade step which will upgrade all the keys to the new format of storage from the old one. During this upgrade we will also get rid of the old Id compatibility serialization form which we use and we will use a serialization form which will be independent of the EntityId but will directly map to it which will help us to convert the serialized form into EntityId as and when needed.

Open Questions

  1. How does metadata for schema applied to external sinks (dataset) which CDAP does not know about like kudu table?
    > Associated with external datasets.
  2. What are the different possibilities of search?
    1. Do we need to support mathematical operators such as >, <, <= etc. In this case the data needs to be treated as numbers. Does the user need to specify the type of metadata being added.
    2. Do we need to support relational operator in search queries. For example: List all datasets 
    3. Metadata now has class/type (business, operational, technical) do we need capabilities to filter metadata on this? 
  3. How are resources like files, partition etc which are not cdap entities and cdap does not know about them are presented in UI when discovered through metadata. 
    > To be designed

...