Field Level Lineage UI (v2)

Checklist

  • User Stories Documented
  • User Stories Reviewed
  • Design Reviewed
  • APIs reviewed
  • Release priorities assigned
  • Test cases reviewed
  • Blog post

Introduction

The current FLL UI lets the user see the root and impact datasets, but it is not possible to easily view the incoming or outgoing operations for a single field. Also, the current UI only shows a limited number of fields from the root and impact datasets. This makes it difficult for users to visualize the lineage of large schemas or see the relationships between a single field and its cause/impact fields. 

The new UI will allow users to answer questions for impact analysis, data quality triage, data compliance, regulatory reporting, and determining trustworthiness of data for a specific field(s).

Goals

This doc describes the planning and design requirements for the new Field Level Lineage (FLL) UI, which will allow users to view and interact with a graphical representation of field level lineage.

Use cases 

A few examples of targeted use cases include:

  • A user will be able to determine how a field in a dataset was generated
  • A user can see how a change in schema will impact downstream datasets and their fields. 
  • A user can troubleshoot a detected quality problem in a field and find the root cause

Design

The user will navigate to this page from the Field Level Lineage button on a dataset’s search page or from the Lineage tab of a dataset detail page.

Component Design

We will use React for rendering the tables, and React Context for state management. We explored using jsPlumb or d3 to draw the edges between fields. While jsPlumb is a library we currently use in pipeline studio to dynamically allow drag and drop edges between nodes, we don’t require a high level of interactivity with the edges for the FLL UI and we decided to use d3 instead.

Code Structure

The FLL UI is made up of the following components

  • LineageSummary - contains the whole graph and section headers

  • FllHeader - Header for each set of cause, impact, and target tables

  • FllTable - Shows dataset information and fields. All fields are shown for the target dataset, and only related fields are shown for the cause and impact datasets. Composed of SortableStickyGrid component. 

  • FllContext - Handles parsing of backend response and management of all data to pass down to components. See State management section for more details. 

State Management

The initial structure of the state properties is shown below.

  • target: the target dataset

  • targetFields: fields of target dataset

  • nodes: list of all fields with id and name

  • links: list of edges between fields. Each edge has a source and destination field

  • causeSets: cause datasets and their fields

  • impactSets: impact datasets and their fields

  • activeField: current selected field in target dataset

  • numTables: max number of cause or impact datasets shown on a page

  • firstCause: index (starting at 1) of first cause dataset

  • firstImpact: index (starting at 1) of first impact dataset

  • firstField: index (starting at 1) of first field of target dataset

The nodes and links properties are included with the assumption that they will be useful for creating links between fields (update: we don’t actually need the nodes). The numTables and firstCause/Impact/Field properties are currently being used to get information for the navigational subheadings, i.e. "Viewing 1 to 4 of 5 datasets." The max number of fields displayed for cause/impact, and target datasets will be added later for pagination. 

API Changes

We will use a new REST endpoint of the Metadata APIThe REST endpoint comes with 3 query params:

  • Direction - specifies which direction to compute the field level lineage. INCOMING, OUTGOING, or BOTH

  • start - the start time string

  • end - the end time string

This endpoint does not currently handle pagination of the response for large schemas (or a large number of cause and/or impact datasets). Therefore, pagination will initially be handled on the frontend. 

Alternatives Considered

Pagination

Large schemas and/or large numbers of datasets requires pagination in the UI. This can potentially be done purely on the frontend, or by the backend by including some extra query parameters. While pagination on the backend is ideal for page load times, we will be doing handling pagination on the frontend for now while keeping in mind that future API changes will include handling pagination.

Batching API calls

In order to get the data needed to render the new UI, the frontend needs to get the field level lineage for each field in the target dataset. With the current UI, that requires making an API call for each field in the target dataset after retrieving all fields. Without doing any data migration, the batching of these API calls can be done by the frontend/UI or API. To minimize complexity of UI code and tech debt, these API calls will be batched on the backend behind a new REST endpoint, which will return a single object to the UI.  

Open questions

Not sure what is in scope for redesign (if any) of operations modal.

Caveats

The first cut of the new UI is not intended to handle very large numbers of fields from root and impact datasets, which may impact the experience of some users. This work will be addressed once the backend API can handle pagination. 

Filtering can be done by field name for the target dataset, and to see a subset of the cause and impact datasets for a specific field. Initially all filtering will be done by the UI but this will also need to be updated once pagination is implemented in the backend API. 

Edge case: Circular references, i.e. when the same dataset is both the cause and target (or cause and impact). The current design will show the same dataset as the target and cause (or impact) when circular references arise. 

Nested schemas are out of scope for this project because the necessary backend capabilities do not yet exist. Any nested schemas will be flattened into a single field. 

Future work

Future API changes may include adding pagination from the backend, which is not in scope for this project. This may help with loading times for: 

  • Giant schemas

  • A large number of cause and impact datasets


Once the pagination is handled by the backend, the UI can be updated easily to modify how the response is parsed and when/how the API calls are made. 

Other future UI changes that are not in scope include a redesign of the operations modal.