Metadata Requirements and Examples

Field level metadata tagging:
1. Capabilities to tag metadata at field level for dataset schemas.
2. The capability needs to be exposed to the data pipelines and plugins as well as enterable through the UI.
3. It should be possible to add tags to the field level both automated and manual way.
  
  Example 1: Metadata based processing: Consider a CDAP data pipeline which reads the user profile dataset. The user profile dataset contains server PII fields such as social security number, phone number of the user etc. A stage in the data pipeline is responsible for the obfuscating the PII data. This stage looks at all the fields and if the field is tagged as 'PII' it obfuscates it.
  
  Example 2. Metadata based discovery: Data governance officer might want to get the list of PII fields in the user profile dataset present in the data lake. In this case he would like to get the fields social security number, phone number, and address as well.
Finer granularity (File/Partition/Table in a database) metadata:
1. Capabilities to annotate the object level metadata. (we need to define what different types of objects are and also is it possible to specify the custom objects).
  For example:
  1. How is the directory on HDFS is created?
  2. Which files are responsible for creating this partition in a dataset
2. This helps answering the compliance related questions such as how the file got created. This also gives a traceability so that If the file is bad we know whats the origin of the file.
  
  Example 1: Organization can have multiple business units such as 'HR', 'Finance' etc. SFTP is commonly used mechanism for sharing data within the business unit in an organization. Since these files are used within the single business unit only the format in which the files are stored might not be consistent in the organization. Files from the HR maybe stored as CSV, while files from Finance may be stored as rich XML format. In order to perform analysis on the file, they must be imported into the HDFS by using CDAP data pipelines. While importing, normalization is done on the files to store them in HDFS in common format.
  
  Data governance officer still need to look at the data in HDFS at file level rather than abstracted CDAP dataset level. By looking at the file he would want to know the checksum associated with the file, which business unit the file belongs to, when the file was created etc.
  
  Example 2: As an extension to the above example, multiple different services (such as CDAP, Custom Hadoop stack etc.) are responsible for pumping files in the data lake. If we tag the file with the name of the process such as 'CDAP', then the data governance officer will know which service is bringing the file in.
Store metadata along with the record: Capabilities are required to store the metadata supplied to the data pipeline as a part of record itself.

Example: Consider a data pipeline which processes files containing news viewership feeds from multiple publishers. Files are tagged with the id of the publisher. In a data lake we want to store the viewership from all publishers in a single dataset, so that we can perform analysis such as which news got highest number of views. However it would be still useful to get the information such as for a particular publisher which news is the most popular. Since the files are tagged with the publisher id, we would like to store the publisher id in the record itself for the analysis in the data lake.

Field level lineage:
Example: Consider a CDAP data pipeline which reads the user profile files from SFTP. Once the files are copied on HDFS, the files are parsed, User Name is normalized in the required format, unique Id is generated for the user, and then it is stored on HDFS in CDAP dataset.

SFTP to HDFS Copy------->File------->Parser------->Name-Normalizer------->IdGenerator------->Store

Consider a sample record which is read by File source - John,Smith,31,Santa Clara,CA

Pipeline Stage	Fields Emitted with values	Field Level Operations
File	body: John,Smith,31,Santa Clara,CA offset: <line offset>	operation: read input: file output: body, offset description: read the file to generate the body field
Parser	first_name: John last_name: Smith age: 31 city: Santa Clara state: CA	operation: parse input: body output: first_name, last_name, age, city, state description: parsed body field
Name-Normalizer	name: John Smith age: 31 city: Santa Clara state: CA	operation: concat input: first_name, last_name output: name description: concatenate first_name and last_name fields operation: drop input: first_name description: delete first_name operation: drop input: last_name description: delete last_name
IdGenerator	id: JohnSmith007 name: John Smith age: 31 city: Santa Clara state: CA	operation: create input: name output: id description: generated unique id based on the name
Store	id: JohnSmith007 name: John Smith age: 31 city: Santa Clara state: CA	No field level operations performed

Now if you want to look at the lineage of the field id which is part of CDAP dataset generated by stage Store of the above pipeline -

           |---first_name---->|
    (parse)|                  | (concat)     (create)
body------>|                   |-------->name---------->id
           |                  |   
           |---last_name----->|

Metadata provenance: According to Wiki (https://en.wikipedia.org/wiki/Provenance#Science) provenance allow us to assign metadata in sufficient details so that it allows us reproducibility of the data in the data lake.
1. This helps answering questions such as who changed what metadata. (we need both pieces who changed and what was the change)
2. Provenance information should be available through the REST api as well as through UI.
3. Technical and business metadata should be tracked.
  
  Example 1: Debugging: Consider that a data governance officer is looking at the fields of a particular dataset and associated metadata for the fields. He notices that the 'User Address' field is missing a PII tag. He looks at the other provenance related tags such as 'Owner process', 'Creation date' and can notify appropriate parties to look into further. For example if the 'Owner Process' is 'Data import team' then he can notify them. Data import team then can drill further based on the 'Creation date' and adjust the pipeline accordingly.
  
  Example 2: Metadata change based processing:
  Only trigger processing if the metadata associated with the source is changed last time since the source processed. (Example not clear yet).
Metadata propagation: By default propagate the metadata at the field level. but this can be overridden by the pipeline developer.

Example: Consider a CDAP data pipeline which reads the user profile information containing PII data for example social security number associated with the user. Based on the social security number field, pipeline fetches additional information about the user such as users current address and phone number. User information then stored in the data lake. Once the data is in the data lake, generic anonymization process can look at the PII fields given a dataset and anonymize them. Since user phone number and address (which are generated from the social security number) are also considered as sensitive personal information it is useful to propagate the PII tag to them too so that those fields will also get anonymize.
Accessing the captured metadata:
1. The captured metadata should be accessible through API, plugins, and UI.
2. This will help applying rules based on the retrieved metadata.
  
  Example 1: If the field is marked as PII, then perform obfuscation process.
  
  Example 2: In flight metadata: Consider a CDAP data pipeline which reads the User profile information. The dataset contains the firstname and lastname as a field. These fields are tagged as 'User Identifier'. Now the pipeline perform Title casing on the field such that first character is capitalised to generate the fields UpperFirstName and UpperLastName. Next stage in the pipeline concatenates the UpperFirstName and LowerFirstName fields to generate the Name field. Since the UpperFirstName, UpperLastName, and Name field are in flight and not persisted to any dataset it, the stage in the pipeline when see these fields should still be able to access the tag associated with those fields which is 'User Identifier'.
Integration with Enterprise metadata systems:
1. Consume the metadata events from the external systems such as Atlas, Colibra etc. into the CDAP.
2. Push the metadata events from CDAP to the external systems such as Atlas, Colibra.
3. What metadata is pushed to the external systems from CDAP need to be configurable.
4. Integration should also allow maintain the referential integrity into the external systems. For example if I browse for a field in the UI corresponding tags should also be fetched from the external systems (not very clear about this requirement)
5. This helps achieving the complete automated metadata framework.
Operational metadata:
1. System should be able to generate the operational metadata.
  Example: A file was generated and during the processing of which 1000 Warning were generated. This information need to be captured.

Following are just notes for Sagar:

Lower priority : FLL need to be available in the plugin so you can do conditional processing based on when the metadata is changed

Metadata need to be in sync with the Colibra and CDAP

possibility to add policies about how to rollup metadata from diff runs

Metadata propagation: By default propagate the metadata at the field level. but this can be overridden by the pipeline developer.

accumulators for the operational metadata - what about checksum?

8. In flight data? for example: change name to full name... any plugin which accesses the field should be able to access the metadata associated with it, even if the metadata is not persisted.