Field level metadata tagging:
- Capabilities to tag metadata at field level for dataset schemas.
- The capability needs to be exposed to the data pipelines and plugins as well as enterable through the UI.
- It should be possible to add tags to the field level both automated and manual way.
Example 1: Metadata based processing: Consider a CDAP data pipeline which reads the user profile information containing PII data for example social security number associated with the user. Based on the social security number field, pipeline fetches additional information about the user such as users current address and phone number. User information then stored in the data lake. Once the data is in the data lake, generic anonymization process can look at the PII fields given a dataset and anonymize them. Since user phone number and address (which are generated from the social security number) are also considered as sensitive personal information it is useful to tag them as PII too so that those fields will also get anonymize.
Example 2. Metadata based discovery: Data governance officer might want to get the list of PII fields in the user profile dataset present in the data lake. In this case he would like to get the fields social security number, phone number, and address as well.
- Finer granularity (File/Partition/Table in a database) metadata:
- Capabilities to annotate the object level metadata. (we need to define what different types of objects are and also is it possible to specify the custom objects).
For example:- How is the directory on HDFS is created?
- Which files are responsible for creating this partition in a dataset
- This helps answering the compliance related questions such as how the file got created. This also gives a traceability so that If the file is bad we know whats the origin of the file.
Example 1: Organization can have multiple business units such as 'HR', 'Finance' etc. SFTP is commonly used mechanism for sharing data within the business unit in an organization. Since these files are used within the single business unit only the format in which the files are stored might not be consistent in the organization. Files from the HR maybe stored as CSV, while files from Finance may be stored as rich XML format. In order to perform analysis on the file, they must be imported into the HDFS by using CDAP data pipelines. While importing, normalization is done on the files to store them in HDFS in common format.
Data governance officer still need to look at the data in HDFS at file level rather than abstracted CDAP dataset level. By looking at the file he would want to know the checksum associated with the file, which business unit the file belongs to, when the file was created etc.
Example 2: As an extension to the above example, multiple different services (such as CDAP, Custom Hadoop stack etc.) are responsible for pumping files in the data lake. If we tag the file with the name of the process such as 'CDAP', then the data governance officer will know which service is bringing the file in. - data governance officer primarily understand the files they want to know which process generated that dataset(for example CDAP), which is using, were there any errors, checksums
- Capabilities to annotate the object level metadata. (we need to define what different types of objects are and also is it possible to specify the custom objects).
- Store metadata along with the record: Capabilities are required to store the metadata from the Source within the supplied to the data pipeline as a part of record itself.
For exampleExample: In Consider a data pipeline if we are reading the inventory data from the source which is tagged as 'HR' to denote that the data belongs to the HR department, then while processing the pipeline it would be useful to add the tag 'HR' to each record so that once data lands in a data lake it can be identified for the analysis purpose. (change it to publisher id)which processes files containing news viewership feeds from multiple publishers. Files are tagged with the id of the publisher. In a data lake we want to store the viewership from all publishers in a single dataset, so that we can perform analysis such as which news got highest number of views. However it would be still useful to get the information such as for a particular publisher which news is the most popular. Since the files are tagged with the publisher id, we would like to store the publisher id in the record itself for the analysis in the data lake. - Field level lineage:
- Metadata provenance:
- This helps answering questions such as who changed what metadata. (we need both pieces who changed and what was the change)
- Provenance information should be available through the REST api as well as through UI.
- Technical and business metadata should be tracked.
Example: (Metadata change based processing): Only trigger processing if the metadata associated with the source is changed last time since the source processed.
- Metadata propagation: By default propagate the metadata at the field level. but this can be overridden by the pipeline developer. For example: if the field is tagged with the PII, any field generated from it will need to be tagged with PII too.
- Integration with Enterprise metadata systems:
- Consume the metadata events from the external systems such as Atlas, Colibra etc. into the CDAP.
- Push the metadata events from CDAP to the external systems such as Atlas, Colibra.
- What metadata is pushed to the external systems from CDAP need to be configurable.
- Integration should also allow maintain the referential integrity into the external systems. For example if I browse for a field in the UI corresponding tags should also be fetched from the external systems (not very clear about this requirement)
- This helps achieving the complete automated metadata framework.
- Operational metadata:
- System should be able to generate the operational metadata. For example: A file was generated and during the processing of which 1000 Warning were generated. This information need to be captured.
- Accessing the captured metadata:
- The captured metadata should be accessible through API, plugins, and UI.
- This will help applying rules based on the retrieved metadata. For example if the field is marked as PII, then perform obfuscation process.
- In flight data
...