Metadata and Lineage User Stories
While deciding on the user stories for metadata and lineage, we need to consider following different types of users:
- Program developer: User responsible for writing CDAP program code such as MapReduce, Spark, etc.
- Plugin developer: User responsible for writing plugins for CDAP data pipeline application.
- Pipeline developer: User responsible for creating pipelines by connecting different types of plugins through UI. Pipeline developer can also supply the transform code as a part of plugins such as JavaScript, Python etc.
- Pipeline runner: User responsible for running the pipelines. Pipeline runner controls the behavior of the pipeline by supplying runtime arguments.
- Admin: Admin user is responsible for certain types of operations such as manually assigning metadata to the field of the dataset.
- Data Governance officer: Non-technical user, mainly concerned with the business metadata and lineage.
- Data scientist: User which is responsible for data analytics and recommendations.
Following section describes the requirements and corresponding user stories from the perspective of different user types as described above.
- Field level metadata tagging:Â
- As a developer of the MapReduce/Spark program, I want a programmatic way to look at the fields in the dataset and assign metadata/tags to it.
- As a developer of the MapReduce/Spark program, I want a programmatic way to read the metadata associated with the fields in the input dataset.
- As a developer of the plugin, I want programmatic way to assign metadata/tags to the fields belonging to the output schema.
- As a developer of the plugin, I want programmatic way to read the metadata/tags associated with the fields from input schema.
- As a developer of the CDAP data pipeline, I want an ability to assign metadata/tags to the fields of the StructuredRecord. For example, if I am creating a pipeline which is reading from database table (say UserProfile), when I populate the schema in the UI, I want to assign tag PII=true to certain fields such as phone number, social security number etc.
- Is there a user story where CDAP data pipeline developer will require to read the metadata associated with the field while developing pipeline?
- As a developer of the CDAP data pipeline, if I add certain plugins such as JavaScript transform where I need to provide my own transform method, I should be able to read the tags associated with the fields from the input schema so that I can do the tags based processing in my Javascript transform method.
- As a developer of the CDAP data pipeline, if I add certain plugins such as JavaScript transform where I need to provide my own transform method, I should be able to assign the tags to the fields belonging to the output schema.
- As a runner of the CDAP data pipeline, I want an ability to provide additional metadata/tags through runtime arguments. For example in the test environment, pipeline runner might not want to obfuscate the PII fields so he should be able to provide the runtime argument "userprofile.field.phonenumber.tags.PII=false"
- Is there a user story where CDAP data pipeline runner will require to read the metadata associated with the field while running the pipeline?
- As an Admin of the CDAP data platform, I should be able to look at what tags are associated with the particular field of a given dataset.
- As an Admin of the CDAP data platform, I should be able to assign new tags/metadata to the particular field of a given dataset.
- As a Data Governance officer, I should be able to list the fields which are marked as "PII" in a given dataset.
- As a Data scientist, I want to get the list of datasets for which the field is marked with a given tag, for example phone number is marked with the "anonymized=true"
- Finer granularity (File/Partition/Table in a database) metadata:
- As a developer of the CDAP program such as MapReduce/Spark, when I read the fileset dataset, I should be able to read the metadata associated with each individual file in the dataset.
- As a developer of the CDAP program such as MapReduce/Spark, when I write to the fileset dataset, I should be able to assign tags/metadata to the each individual file in the dataset.
- As a developer of the CDAP program such as MapReduce/Spark, when I read the partitioned fileset dataset, I should be able to read the metadata associated with each partitions in the dataset.
- As a developer of the CDAP program such as MapReduce/Spark, when I write to the partition fileset dataset, I should be able to assign tags/metadata to the each partition in the dataset.
- As a developer of the CDAP Action plugin, I should be able to read the tags/metadata, such as data quality score, associated with the each individual file in a dataset.
- As a developer of the CDAP Action plugin, I should be able to read the tags/metadata associated with the each partition of the partitioned fileset dataset.
- As a developer of the CDAP Action plugin, I should be able to assign the tags/metadata, such as data quality score, to the each individual file in a fileset dataset.
- As a developer of the CDAP Action plugin, I should be able to assign the tags/metadata to the each partition of the partitioned fileset dataset.
- As an Admin of the CDAP platform, I should be able to list the the tags/metadata associated with the individual file in the dataset.
- As an Admin of the CDAP platform, I should be able to assign/ovveride tags/metadata associated with the individual file in a dataset.
- As an Admin of the CDAP platform, I should be able to list the the tags/metadata associated with the individual partition in the partition fileset dataset.
- As an Admin of the CDAP platform, I should be able to assign/ovveride tags/metadata associated with the individual partition in the partition fileset dataset.
- As a Data Governance officer, I should be able search for files on the HDFS given a specific tag/metadata for example 'Owner=HR", gives me all files owned by HR department.
- As a Data Governance officer, I should be able search for all the directories on the HDFS given a specific tag/metadata for example 'CreationDate=12/30/2017", gives me all directories created on the specified date.
- As a Data scientist, I only want to use the files which are tagged with certain tag for example "SecurityCode=green" for analysis for compliance reasons.
- (what is the role of CDAP pipeline developer and CDAP pipeline runner in this particular section? Can they use this capability somehow?)
- Store metadata along with the record:
- As a developer of the MapReduce/Spark program, I want an ability to read the tags/metadata associated with the files/partition/dataset in the map and reduce / executors tasks so that I can emit them as an additional field in the record.
- As a CDAP plugin developer, in the transform method, I want an ability to read the tags/metadata associated with the files/partition/dataset so that I can emit them as an additional field of the StructuredRecord.
- Field level lineage:
- As a developer of the MapReduce/Spark program, I want a programmatic way to specify the transformations happening on the fields belonging to the input dataset to generate the fields in the output dataset. Along with the transformation, I should be able to specify the higher level readable description for the transformation which will be helpful to the non-tech users such as Data Governance officer.
- As a developer of the Source plugin for the CDAP data pipeline application, I want an ability to specify what fields are generated from which source.
- Note 1: In the source plugin we can specify transformations. For example, File source reads the content of the file and produces two fields "offset", and "body". Kafka source reads the bytes from the kafka topic and depending on the format specified in the plugin configuration (CSV, bytes, ..), the transform method of the Source will attempt to create the fields specified in the output schema. For lineage purpose, "Reference Name" can be used as a name of the dataset, however what are the fields should be shown as a part of dataset in UI while displaying lineage diagram? Should we show generic field say "Data", or "Record" or should it contain the fields from the output schema?
- Note 2: There are additional properties associated with the sources such as Kafka source has properties broker address, topic names. File source has additional properties such as file path. These properties would be helpful in the lineage digram to figure out where the source data is actually coming from rather than only having the Reference name there. Should these properties be collected as a part of the field level lineage API itself or should it be captured as dataset's (as indicated by Reference name) properties?
- As a developer of the Transform plugin, I want an ability to specify the field transformations happening on the fields belonging to the input schema to generate fields on the output schema.
- As a developer of the Sink plugin, I want an ability to specify the transformations happening on the fields.
- Note 1: Consider Kafka Producer sink. The sink gets the input schema from the previous stage. However this sink also accepts the Message Configurations (Message Format and Message Key Field). Message format (CSV, JSON) is used to format the input structured record. Do we need to preserve this information as a part of Field Level Lineage api?
- Since Action plugins do not have the data flow, we will not expose Field Level Lineage for them.
- Â As a developer of the CDAP data pipeline, in certain cases I want an ability to specify the Field level operations. (Sagar: Need to think more about this user story. For example, I am using a plugin which removes the fields from the input schema which are tagged with "PII=true". How would the lineage be provided then? The same is true for the plugins such as JavascriptTransform)
- As a developer of the CDAP data pipeline, I want an ability to provide description of a field operation that the stage is performing.
- As a Data Governance officer, I want an ability to look at how the field got generated within the specified time window.
 Â
- Metadata provenance:
- As a Data Governance officer, I want an ability to get all the fields in the dataset and corresponding tags/metadata.
- As a Data Governance officer, I want an ability to get all the fields in the dataset and corresponding tags/metadata.
- Metadata propagation:
- For the CDAP program developer, user stories will be similar to the Field level metadata tagging where he would want a way to read the metadata from the source dataset and write the metadata for the field in the destination dataset.
- As a developer of the CDAP plugin, when I provide the lineage information, I want an ability to specify whether I should copy the metadata/tags from the source field into the destination field as well.
- As a developer of the CDAP plugin, while copying metadata from the source field to the destination, I also want an ability to provide additional metadata, possibly overriding the copied one. For example if plugin anonymizes the PII field, then its no longer PII but it need to be marked with new tag "anonymize=true"
- As a pipeline developer/pipeline runner, I want an ability to ovveride the copying of the metadata from the source field to the destination field.
Â
Â
Â