Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

  1. Requirement document from the customer mentions that we need automated/manual way to add metadata/tags at the field level.
    Sagar: My understanding is automated way meaning tags are emitted programmatically (by plugins or CDAP programs) while manual way is assigning tags through UI (or REST API). For example, plugin can be implemented to mark the user address field as 'PII' in automated fashion, however once the data is landed in the data lake, data governance officer decides that another field say, phone number should also have been marked as 'PII'. He should be able to do that through UI which is manual way. We can confirm this requirement though.

  2. Requirement document from the customer mentions that metadata provenance is required for the business as well as technical metadata. Should we have provenance for all type of technical meatadata, for example if during multiple pipeline runs number of WARNINGS got changed, should we track that change too?

  3. Should metadata be stored for every run of the pipeline? What if different runs produce/propagate different meta data? How is that resolved?
    Sagar: Since there is a possible usecase of manually tagging the field through UI and REST endpoint (which wont have any runid) we may not want to store it run level. For aggregating metadata from multiple runs, we can provide the policies (OVVERIDE, MERGE) etc. Rohit Sinha is this addressed in any of your design document? If not we can add it.

  4. When do we emit the Metadata to the TMS? Is it happen when the code executed or the metadata is cached in memory and emitted in the "destroy" method of the program? If we create new field, then we also need to copy its metadata. Should the metadata of the newly created field emitted to TMS instantaneously or should it be emitted in the "destroy" method when the field is actually created.

Following are just notes for Sagar:

...