Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Table of Contents

...

Metadata Storage and Indexing

In the current implementation of MetadataDataset, the key which is stored is a toString representation of the EntityId i.e.
EntityType.entitydetails.key For example for a dataset it looks likeDatasetInstance:namespace.datasetName.metadataKey

Code Block
<length-encoding>DatasetInstance<length-encoding>namespaceName<length-encoding>datasetName<length-encoding>metadataKey

Note: We store the old Id representation of the Ids and not EntityIds to keep backward compatibility with serialized keys from before. During this release when we will be upgrading the metadata store we should defenitely migrate all the keys to not use old Ids and use a serialization form which is independent of EntityIds etc so that our serialization does not break with renames/changes of EntityIds.

We do this because this allows us to search for queries like dataset:* or queries getMetadata() queries for an entity like Dataset.

For more information please refer to earlier design documentation of our metadata store .

For more information please refer to earlier design documentation of our metadata store and the implementation here:

...

With the proposed changed in this design document we will introduce a class called MetadataEntity which will be a List of key-value pairs. In a simple represetation it will look like:namespace=nsOne<separator>dataset=dsOne

Code Block
<length-encoding>namespace<length-encoding>nsOne<length-encoding>dataset<length-encoding>dsOne<length-encoding>metadataKey

 

Also for a file in PFS it will look something like this

namespace=nsOne<separator>dataset=dsOne<separator>partition=partitionOne<separator>file=fileOne

...

Code Block
<length-encoding>namespace<length-encoding>nsOne<length-encoding>dataset<length-encoding>dsOne<length-encoding>partition<length-encoding>partitionOne<length-encoding>file<length-encoding>fileOne<length-encoding>metadataKey


We cannot store this with our current storage key as the key be something like this:

...

Since files are not an EntityId in CDAP, CDAP does not know the hireracy of this custom entity type. Hence CDAP will not be able consturct the MetadataEntity back since all the individual keys are not persisted in the above format. To solve this issue we will now store the MetadataEntity information with all the key-value pairs. To maintain backward compatibility and support search based on the entity type we will also be storing the information where the key is prefixed by the target entity type as earlier. So finally the key will look something like this:

Code Block
<length-encoding>file<length-

...

Search Queries:

...

encoding>namespace<length-encoding>nsOne<length-encoding>dataset<length-encoding>dsOne<length-encoding>partition<length-encoding>partitionOne<length-encoding>file<length-encoding>fileOne<length-encoding>metadataKey

It should be noted that it is important to store the keys prefixed by the type because it limits our scan size when we retrieve metadata for an entity/non-entity. For example consider the following scenario

Lets say myStreamOne is tagged with myTagOne and myTagTwo and myStreamViewOne is tagged with myTagThree

EntityType:EntityDetails.MoreEntityDetails.MetadataKey
So it looks like this: (Note the : and . are just for readability current we store length encoding)

Code Block
stream:myNamespaceOne.myStreamOne.myTagOne
stream:myNamespaceOne.myStreamOne.myTagTwo
stream_view:myNamespaceOne.myStreamOne.myStreamViewOne.myTagThree


If we change it store key-value parts (without entity-type prefix) of entities the above will look like:

Code Block
namespace=myNamespaceOne.stream=myStreamOne.myTagOne
namespace=myNamespaceOne.stream=myStreamOne.myTagTwo
namespace=myNamespaceOne.stream=myStreamOne.stream_view=myStreamViewOne.myTagThree


Now when someone says give me all the metadata for MyStreamOne we do a prefix based search to collect all the metadata keys where the search prefix is (in current implementation)

stream:myNamespaceOne.myStreamOne.

With our MetadataEntity change the search prefix will look like this:

namespace=myNamespaceOne.stream=myStreamOne.

The problem with above new key is that it will also match
namespace=myNamespaceOne.stream=myStreamOne.stream_view=myStreamViewOne.myTagThree

and give us the metadata for stream view which is child of the stream. We can of course filter them out as a post-processing step but this is very bad for searches for namespaces because this will give metadata for everything inside namespace. Such large scan result can easily be eliminated if we store the keys prefixed by entity-type. If an entity-type is not known then we can store it as a some constant like UNKNOWN_TYPE.

 

Search Queries:

We will maintain support for all search queries as listed here for backward compatibility. No new search capabilites will be added.

 

Upgrade:

We will need an upgrade step which will upgrade all the keys to the new format of storage from the old one. During this upgrade we will also get rid of the old Id compatibility serialization form which we use and we will use a serialization form which will be independent of the EntityId but will directly map to it which will help us to convert the serialized form into EntityId as and when needed.

Open Questions

  1. How does metadata for schema applied to external sinks (dataset) which CDAP does not know about like kudu table?
    > Associated with external datasets.
  2. What are the different possibilities of search?
    1. Do we need to support mathematical operators such as >, <, <= etc. In this case the data needs to be treated as numbers. Does the user need to specify the type of metadata being added.
    2. Do we need to support relational operator in search queries. For example: List all datasets 
    3. Metadata now has class/type (business, operational, technical) do we need capabilities to filter metadata on this? 
  3. How are resources like files, partition etc which are not cdap entities and cdap does not know about them are presented in UI when discovered through metadata. 
    > To be designed

...