Metadata 5.X+

Introduction

 

Metadata is data about data, in other words, data that describes other data. There are many kinds of meta data, including:
  • Operational meta data describes the way that data was processed or created:
    • metrics: statistics about the data, and possibly about the processing that produced the data. 
    • lineage: who produced this data when and how
    • audit: who or what accessed this data in what way (read or modified)
  • Technical meta data is associated with data and describes its technical properties, etc:
    • checksums, number of records
    • format, schema, etc.
  • Business meta data is associated with data to tag, categorize, inventorize it, or comply with some other business process. It is typically not intrinsic to the data, that is, it cannot be derived from the data itself. 
    • Tags such as “confidential”, “pii”, “financial"
    • Properties such as “businessUnit:xyz” or market “EMEA”
Applications for metadata are many and impossible to list here. In the context of CDAP, meta data is used with two main purposes:
  1. Data Governance: 
    1. Traceability: For a piece of data, where did it originate, how was it processed/transformed, where was it sent to, etc. 
    2. Compliance: Many enterprises are under strict regulations that require the ability to trace back all data (and meta data) to its origin and over its life time. 
  2. Discovery: Data scientists or business analysts use meta data to find data that they are interested in.

User Stories


[Trace back]

  1. A credit card statement has a wrong charge and the customer complained about it. The bank needs to find out where the incorrect data originates from. 
    • Was the original data already incorrect? Then it needs to be identified for further action
    • Was the data damaged during processing? If so, how was it processed, what were the pipelines/plugins that processed it, with what configuration? 
  2. A downstream process fails because its input data contains a field that does not comply with the schema. The operations team needs to determine why: 
    • What pipeline produced this data from what input data?  
    • What operations were applied to the input to produce this field?
  3. A user notices that a time stamps in a data set are in the wrong time zone, but only for some data. The operations team needs to find out:
    • Where did the incorrect data come from? Is it one of the data providers that sends incorrect time stamps? Or is the problem in the pipeline that ingested the data?
  4. An Admin finds out that a certain dataset is very popular and used by many downstream consumers. He wishes to trace it back to the source to apply stronger policies to secure such data sources.

[Trace forward]

  1. A data provider calls a bank's data lake operator and notifies him that the data received over a time period was wrong. The bank now needs to find out what other data was derived from this data, and reprocess it with the correct input data.
  2. Knowledge about how the output of a pipeline in used by downstream consumers can help the pipeline developer optimize the pipeline. For example, apply a filter or normalization if he finds out all consumers apply that.  
  3. An Admin found out that a source inappropriately contained sensitive information. Tracing forward helps him determine derived datasets that need to be (re-)classified as sensitive. 

[Meta data provenance]

  1. A data scientist notices that a data set is not tagged as “PII” even though it contains phone numbers. He call the data lake operations team. The team that produces the data assures that they have tagged this data as “PII”. The operations team wants to find out why the tag is missing - was it modified or removed after the fact or was it missing at creation time? - and consults the audit logs/change history of the data set’s meta data. 
  2. A data scientist noticed a data set which was tagged with a tag. The dataset scientist wants to know who added this and tag and time it was added.

[Discovery]

  1. For a data experiment, a scientist wants to process credit card transactions that have been normalized to UTC time stamps. How can he find a dataset that has this data? And if that data does not exist, how can he find a data set with credit card transactions, and normalize the time stamps himself? He will search the meta data for:
    1. Datasets that are tagged / described as credit card transactions
    2. Datasets that have a time stamp field tagged “utc” or “normalized”

[Fine-Grained Metadata]

  1. In case of major security breach, the Admin of the data lake can validate the authenticity of each file in a dataset based on its creation time.
  2. Data quality can vary within a dataset, based the the origin of each file. It is useful to assign data quality metadata to each file. 

[Metrics as Metadata/Data Quality]

  1. The data scientist further wants to understand the quality of the data. For this, he wants to see the processing metrics for each file in the data set
    1. how many records were processed
    2. how many records were discarded due to schema/data validation errors
  2. In a data lake, various processes are responsible for dumping data from a variety of sources. The quality of the data produced varies based on where the data is coming from. It is important for the user to identify which sources are producing low quality data by tracing back to them. User can then apply additional pre-processing on such sources or simply quarantine them.

[Metadata propagation]

  1. A developer wants to create a pipeline that reads from a dataset, applies some transformations, and propagates meta data from its input to its output. For example, if a field in the input data is tagged as “PII”, the corresponding field in the output data should also be tagged “PII". However, if the pipeline anonymizes that field, it should not be tagged as “PII” in the output, but rather as “anonymized”. 
  2. A developer wants to create pipeline that read from a dataset, applies some transformation, and propagates some attributes of the source to its output. For example, he might want the output to be tagged with the filesize of the input file.
  3. Organizations typically maintain one data lake which gets data pumped into it from different departments. While analyzing such data in the data lake, data scientist needs additional information. For example field named 'resource' in the data lake can have different meaning based on where it is originated from. For Admin Operation department, resource can simply represent the hardware unit, however for Human Resource department, resource can represent the employee information. Therefore, it would be best to annotate the sink dataset with the origin upon ingestion. 
  4. As a part of the ingestion process, data can be tagged with the owner information. Such owner information can be used by data scientists to assign weightage to the dataset. 

[Integrations]

  1. An enterprise has a business meta data system and would like to synchronize the CDAP metadata with that system. For example, Atlas, or Collibra. 
    1. Periodic batch import/export
    2. Batch export of all meta data that has changed since last export
    3. Tight integration through exposing all metadata changes via a message bus
    4. Query external system from pipeline
    5. Publish to external system from pipeline

Required Platform Capabilities

[Trace back] Ability to trace a single record back to its origin

  • we would need to know
    • What run of which pipeline produced this record?
    • How was each field of the file (the output of that run) computed?
    • When traced back to the source, what input file was it in?
  • this can be accomplished by
    • adding the input file name and the id and run id of the pipeline to each record
    • computing field-level lineage for each run of a pipeline
    • possibly repeating this step for the pipeline that produced the input for this pipeline; etc.
[Trace forward]
  • we need to know for this dataset:
    • what files were received during this time frame?
    • what pipelines processed any of these files, and what were their outputs?
    • possibly transitively the same for pipelines that processed the outputs
  • this can be accomplished by
    • storing a lineage graph from dataset to dataset
    • annotating each file in a dataset with the run id of the pipeline that processed it
    • a tool recursively/transitively finds all files produced from the affected files
[Meta Data Provenance]
  • we need to know
    • what meta data was associated with this field when it was ingested originally
    • what changes were made to the meta data afterwards, and who made them?
  • this can be accomplished by
    • storing a change log for all meta data in a retrievable way (more than just logging it)
[Discovery]
  • we need to be able to 
    • tag and annotate fields of a dataset’s schema and make that searchable
    • complex queries (such as “credit card transactions AND timestamp:(UTC or normalized)
[Metrics as Metadata]
  • we need to 
    • store metrics as meta data during processing
    • retrieve these metrics as meta data during discovery 
[Metadata Propagation]
  • we need 
    • programmatic APIs (for plugins) to access and publish meta data
      • meta data should only written of pipeline is successful
    • ways to configure a pipeline:
      • what meta data to retrieve from context/arguments/external 
      • how to publish that meta data, and for what entities
  • minimum requirement is to have plugin APIs such that a custom plugin can do it
    • better: a Python/JavaScript action plugin to avoid compile/package/deploy
    • even better: a DSL or set configuration/directives for Hydrator to avoid coding 

Requirements

Store
  • associate meta data with a file
  • associate meta data with a field of a dataset (’s schema)
  • retrieve meta data for non-CDAP entities
  • search meta data for non-CDAP entities
  • retrieve the change history for all meta data of an entity (and its sub-entities)
Lineage
  • File to file lineage
  • Field lineage
    • collect per plugin/transform/directive
    • present as graph or similar navigable UI
Pipeline
  • propagate meta data from source to sink
  • map input files to output files 1:1
  • conditional processing based on meta data
  • explicitly set meta data for en entity
  • associate processing metrics as meta data for the sink
  • define meta data based on condition
Integrations
  • query meta data for an entity from an external meta data system
  • publish meta data to an external meta data system
  • all meta data operations via message bus
  • batch import/export of meta data (only changes)
  • authorization for meta data through Ranger/Sentry/external auth provider

Current Roadmap

5.0:
5.1:
  • File/Partition/custom entity meta data
  • Integration with external meta data systems
5.2:
  • Metadata provenance
  • Operational metadata 
  • Catalog of all data by metadata