Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
  1. Field level metadata tagging: 

    1. Capabilities to tag metadata at field level for dataset schemas.
    2. The capability needs to be exposed to the data pipelines and plugins as well as enterable through the UI.
    3. It should be possible to add tags to the field level both automated and manual way.

      Example 1: 1. (Metadata based processing) Field in the datasets are marked as PII. Generic anonymization process which takes a dataset as an input and anonymize all the fields which are tagged as PII.
      2. (Metadata based discovery) Give me all the datasets from the datalake which has field marked as PII.: Consider a CDAP data pipeline which reads the user profile information containing PII data for example social security number associated with the user. Based on the social security number field, pipeline fetches additional information about the user such as users current address and phone number. User information then stored in the data lake. Once the data is in the data lake, generic anonymization process can look at the PII fields given a dataset and anonymize them. Since user phone number and address (which are generated from the social security number) are also considered as sensitive personal information it is useful to tag them as PII too so that those fields will also get anonymize.

      Example 2. Metadata based discovery: Data governance officer might want to get the list of PII fields in the user profile dataset present in the data lake. In this case he would like to get the fields social security number, phone number, and address as well. 

  2. Finer granularity (File/Partition/Table in a database) metadata:
    1. Capabilities to annotate the object level metadata. (we need to define what different types of objects are and also is it possible to specify the custom objects).
      For example: 
      1. How is the directory on HDFS is created?
      2. Which files are responsible for creating this partition in a dataset
    2. This helps answering the compliance related questions such as how the file got created. This also gives a traceability so that If the file is bad we know whats the origin of the file.
    3. data governance officer primarily understand the files they want to know  which process generated that dataset(for example CDAP), which is using, were there any errors, checksums
  3. Store metadata along with the record: Capabilities are required to store the metadata from the Source within the record itself.
    For example: In a data pipeline if we are reading the inventory data from the source which is tagged as 'HR' to denote that the data belongs to the HR department, then while processing the pipeline it would be useful to add the tag 'HR' to each record so that once data lands in a data lake it can be identified for the analysis purpose. (change it to publisher id)
  4. Field level lineage:
  5. Metadata provenance:
    1. This helps answering questions such as who changed what metadata. (we need both pieces who changed and what was the change)
    2. Provenance information should be available through the REST api as well as through UI.
    3. Technical and business metadata should be tracked.
      Example: (Metadata change based processing): Only trigger processing if the metadata associated with the source is changed last time since the source processed.
  6. Metadata propagation:  By default propagate the metadata at the field level. but this can be overridden by the pipeline developer. For example: if the field is tagged with the PII, any field generated from it will need to be tagged with PII too.
  7. Integration with Enterprise metadata systems:
    1. Consume the metadata events from the external systems such as Atlas, Colibra etc. into the CDAP.
    2. Push the metadata events from CDAP to the external systems such as Atlas, Colibra.
    3. What metadata is pushed to the external systems from CDAP need to be configurable.
    4. Integration should also allow maintain the referential integrity into the external systems. For example if I browse for a field in the UI corresponding tags should also be fetched from the external systems (not very clear about this requirement)
    5. This helps achieving the complete automated metadata framework.
  8. Operational metadata:
    1. System should be able to generate the operational metadata. For example: A file was generated and during the processing of which 1000 Warning were generated. This information need to be captured.
  9. Accessing the captured metadata:
    1. The captured metadata should be accessible through API, plugins, and UI.
    2. This will help applying rules based on the retrieved metadata. For example if the field is marked as PII, then perform obfuscation process.
    3. In flight data

...