Metadata Management - External Integrations

Goals: 

  • CDAP contains multiple entities - for ex, Namespaces, Applications, Programs, Datasets (there could also be fine-grained entities such as Partitions in a PFS Dataset or the fields in a Table Dataset).
    We have system and business metadata for each of these entities. We should be able to push this data to external Metadata management systems, such as, Cloudera Navigator, Apache Atlas etc, henceforth referred to as MDM. 

Checklist 

  • User stories documented (Gokul)
  • User stories reviewed (Nitin)
  • Design documented (Gokul)
  • Design reviewed (Andreas)
  • Feature merged (Gokul)
  • Examples and guides (Gokul)
  • Integration tests (Gokul) 
  • Documentation for feature (Gokul)
  • Blog post (Gokul)


User Stories:
 

  • CDAP business and system metadata entities should automatically show up in MDM
  • CDAP user should be able to search for CDAP business and system metadata using MDM
  • Any updates/deletes to system or business metadata in CDAP should automatically reflect in MDM
  • Users should be able to search on dataset or streams schema fields (fine-grained entities) in MDM
  • Existing metadata (data that existed before MDM integration was enabled) should also be made available in MDM (depends on whether messages are available in Kafka) (Low priority)
  • Updates/deletion of custom metadata in MDM should be reflected in CDAP (Low priority)
  • Pushing business metadata of CDAP entities to underlying entities - For example, if a CDAP Table dataset is marked as ‘sensitive’, this tag should be pushed to the corresponding HBase Table created by CDAP (Low priority) 

Design:

Technical Constraints for Cloudera Navigator

Navigator, currently, pulls in data periodically from different Hadoop components - HDFS, Hive etc. It uses Solr for indexing. But Navigator does provide a simple Java Client to set and query metadata.
Though it is limited in its features, it can potentially be used to push custom metadata for entities to Navigator. But there few known and unknown issues:
 

  1. Pushing data seems straightforward using the Java client but subscribing to metadata changes in Navigator doesn’t seem straightforward
    Ramification: Users can’t edit business metadata for CDAP entities in Navigator and expect it to reflect in CDAP Metadata system
    Tradeoff: Navigator can only read CDAP metadata but can’t modify/write. Any modifications will not be reflected in CDAP Metadata Store.

  2. Creating brand new SourceType, EntityType(s) doesn’t seem possible using the Java Client SDK.
    Ramification: This is a big blocker if we want the Source Type CDAP at the same level as SourceType of HDFS, Hive, (HBase - missing), Oozie etc. There is a catch all SourceType called SDK but then again EntityTypes doesn’t seem to be flexible to allow new ones.
    Workaround: Have to check with Cloudera Navigator team to see what is feasible given our ideal data model. For ex, use SourceType.SDK.

  3. SystemMetadata => Technical Metadata - setting this through Java client seems to be not possible.
    Ramification: Java client can only alter custom metadata. This might be confusing for users as there is a clear one-to-one mapping between system : technical :: business : custom metadata.
    Tradeoff: Don’t push System metadata. Only publish business metadata. Or publish system metadata fields as ‘custom metadata’ in Navigator.

  4. UI Rendering for different Source/Entity combinations: If we do manage to setup custom CDAP SourceType, not quite sure how the UI rendering will work for the same. It is fairly simple for most of the SourceTypes (with special tags for say Hive Table schema) but have to confirm if there is support by default for other types.
    Ramification: Lose out on potentially enriched user experience 

Architecture Design

Option i: A System Service to push our metadata changes to external metadata management system. This is an optional system service that can be enabled using cdap-site.xml and a pluggable external system can be chosen. For this work, the external system will be Navigator but in future we can support Apache Atlas. The system service will subscribe to Kafka topic to which metadata changes are published by the CDAP MetadataAdmin. These messages are then pushed to the external system - in case of Navigator we could use the Navigator SDK Java client. We will also have to use a system dataset to store the Kafka offset. Potential downside of this approach is that we will be consuming another valuable container resource in the cluster.
 

Option ii: Custom CDAP Application that has a Worker or a Flow that subscribes to Kafka messages and uses the Navigator Java client to write to Navigator. Advantage of this approach is that the user can decide when he wants to run the Program. The downside of that is that if there is a failure with the program logic, the issue will only appear in application logs and may not be monitored with the same vigor as that of system services.

 

Additional Details:

 

Cloudera Navigator SDK Java Client
Compatibility Matrix for Java Client (Java client needs Navigator version >= 2.4 which in turn requires CDH version >= 5.5)


Cloudera Navigator Primer 

Cloudera Navigator is a one shop stop for users of a CDH cluster to query/modify metadata information of various Hadoop entities. Navigator is also useful for Audits and Lineage but in this integration, we are simply focusing on Metadata part of it.
Here is a screenshot of how the metadata is displayed for a HDFS directory.

Screen Shot 2015-12-18 at 1.09.31 PM.jpg


As you can see, the Cloudera Navigator knows about each Source and Entity Type. In this case the SourceType is HDFS and Entity Type is Directory. Technical Metadata (analogous to System Metadata in CDAP parlance) varies for different Source/Entity combination.
Custom Metadata (analogous to Business Metadata in CDAP) allows users to set tags and Key-Value combinations as properties.

Custom Metadata is indexed for search and so is some of the Technical Metadata (not sure what fields are indexed and how they are indexed). So in the above scenario, one can query for ‘sensitive’ tag and they will get all entities that have that custom metadata tag set.


Screen Shot 2015-12-18 at 1.17.41 PM.jpg