Tracker Data Dictionary

Checklist

  • User Stories Documented
  • User Stories Reviewed
  • Design Reviewed
  • APIs reviewed
  • Release priorities assigned
  • Test cases reviewed
  • Blog post

Introduction 

 A data dictionary will be a way for users to define and describe columns that apply across all datasets in a namespace and allow users to enforce a common naming convention, type, and indicate if the column contains PII data. Anyone creating new datasets or browsing datasets would then be able to see and leverage this information in Tracker. 

Goals

After the completion of this project users will have a single point of governance for their data fields across namespace.

Use Cases

  • A team would like to make sure that any new datasets they create use the same naming conventions for field names. For example, in all datasets, the accountId is always in the same format (not accountID or account_id), and always contains the same type of data (always a nullable string, not an int or long). Using the data dictionary, the team sets the name, format, and definition of the column, and enables compliance checking. Now, anyone browsing through the datasets available in the namespace through Tracker can see the definitions for all the columns. If the user finds a dataset which is out of compliance, he/she can take action to bring it into compliance.
  • A company has specific fields in their datasets that are sensitive such as SSN or full name. They would like these columns to be tracked across all datasets so that the users of the system know to treat it with a higher level of secrecy. By adding these columns to the data dictionary, they are able to identify these columns across any dataset and make sure they are keeping the data safe. As new datasets are created with the same columns, the data dictionary is automatically applied and the columns are marked as PII. If desired, someone would be able to write a script to automatically add tags to datasets containing PII columns and audit them to make sure only specific people have access to them.

User Stories 

    • As a user of Tracker, I would like to define and enforce a common data dictionary across all my datasets.
    • As a user of Tracker, I would like to see when a specific field in a dataset is not in compliance with my data dictionary.
    • As a user of Tracker, I would like to view the reasons why my specific field is not in compliance with my data dictionary.
    • As a user of Tracker, I would like to mark specific fields as PII or not.
    • As a user of Tracker, I would like to be able to easily add existing fields to the Data Dictionary.

Design

Dictionary

  • The data dictionary will be stored in a new custom dataset specific to a namespace and backed by a Table.
  • The name will be _dataDictionary
  • The schema of the dictionary table will be as follows:
    • rowKey - the column name, all lowercase
    • columnName - the column name with the case preserved
    • columnType - the type of the column
    • isNullable - is the column nullable or not
    • isPII - is the column PII or not
    • description - the description of the column
    • datasets - a list of datasets containing the column
    • numberUsing - will default to 1, and will be increased if other extensions start using this data

Config Options

  • As part of this, we will also need to introduce the notion of configuration preferences for Tracker in order to store the state of compliance checks for this instance.
  • We will create a new config dataset which will be a simple key-value store to keep track of this information. The goal would be to store additional config values in there as needed by future tracker features.
  • We will create a new config api handler to manage the configuration options.
  • Config keys will be in the form of <feature>.<option>. So for the compliance checks for the dictionary, the config key would be "dictionary.compliance" with a value of "true" or "false"

New Programmatic APIs

New REST APIs

Dictionary

  • We will need the following additional endpoints to support the data dictionary which will be added in a new handler class
    • Method
      Endpoint
      Description
      Params
      Sample Response
      GET/v1/dictionaryReturns the entire data dictionary for the namespacenone

      On success, returns a 200.

       

      { "results" : [] }

       

       

      {
        "results" : [
          {
            "columnName" : "testColumn1",
            "columnType" : "string",
            "isNullable" : true,
            "isPII" : false,
            "description" : "something something something"
          },
          {
            "columnName" : "testColumn2",
            "columnType" : "long",
            "isNullable" : false,
            "isPII" : true,
            "description" : "else else else"
          }
        ]
      }
      POST/v1/dictionaryReturns the data dictionary related to the specified schemapayload - a schema containing a list of columns

      On success, returns a 200.

       

      { "results" : [] }

       

       

      {
        "results" : [
          {
            "columnName" : "testColumn1",
            "columnType" : "string",
            "isNullable" : true,
            "isPII" : false,
            "description" : "something something something"
          },
          {
            "columnName" : "testColumn2",
            "columnType" : "long",
            "isNullable" : false,
            "isPII" : true,
            "description" : "else else else"
          }
        ]
      }
      POST/v1/dictionary/validateValidates the given schema against the current data dictionarypayload - A json string containing the schema to validate
      { "status" : "OK", "results" : [] }

       

       

      {
        "status" : "CONFLICT",
        "results" : [
          {
            "
            "columnName" : "nameOfColumnFromSchema",
            "expectedName" : "nameOfColumnFromDict",
            "columnType" : "string",
            "expectedType" : "string",
            "isNullable" : true,
            "expectedNullable" : true,
            "reason" : "The column case did not match the data dictionary."
          },
          {
            "columnName" : "nameOfColumnFromSchema",
            "reason" : "The column does not exist in the data dictionary."
          }
        ]
      }
      POST/v1/dictionary/{column-name}Add a column to the data dictionary

      column-name - the name of the column to add (Case Sensitive) payload - a JSON string in the format:

       

      {
        "columnType" : "String",
        "isNullable" : true,
        "isPII : false,
        "description" : "this is a description of the column"
      }
      On success, returns a 200.
      PUT/v1/dictionary/{column-name}Update a row in the data dictionary

      column-name - the name of the column to add (Case Sensitive)

      payload - a JSON string in the format:

       

      {
        "columnType" : "String",
        "isNullable" : true,
        "isPII : false,
        "description" : "this is a description of the column"
      }
      On success, returns a 200. If the field is not found, returns a 404.
      DELETE/v1/dictionary/{column-name}Delete a column from the data dictionarycolumn-name - the name of the column to delete (Case Sensitive)On success, returns a 200. If the field is not found, returns a 404. If the numberUsing > 1, return 409 conflict.

Configuration

  • Method
    Endpoint
    Description
    Params
    Sample Response
    GET/v1/configReturns the entire tracker config as a key value mapnone
    {
      "config-key-1" : "config-value-1",
      "config-key-2" : "config-value-2"
    }

     

    If no values are set, an empty map is returned.

     

    {}
    GET/v1/config/{config-key}

    /v1/config/{config-key}?strict="true" 

    Returns the value for the given config key.

    If strict is "true"

    This will return a single configuration value with key strictly matching with config-key. This will be a faster operation for single key-value lookup

    If strict is "false" or not provided
    This will perform a row scan of the table, so partial values are also accepted.

     

    config-key the key for the config to return

    Query parameter "strict" which can be true or false 
    [{"config-key" : "config-value-1"}]

     

    If multiple results are found:

     

    [
      {"config-key-1" : "config-value-1"},
      {"config-key-2" : "config-value-3"}
    ]

     

     

    Returns 404 if no config key was found.

     

    []
    POST / PUT/v1/config/{config-key}Sets the value of the specified config key

    config-key the key for the config to set payload the value for that config

    { "value" : "configValue" }
    Returns 200 if the key was successfully set or updated Returns 400 and an error message if the key or value is invalid for some reason.
    DELETE/v1/config/{config-key}Deleted the value of the specified config keyconfig-key the key for the config value to deleteReturns 200 if the key was deleted Returns 404 if the key was not found


UI Impact or Changes

Tracker will have a tab for data dictionary where users will be able to see and interact with the dictionary for that namespace.

Security Impact 

 

Impact on Infrastructure Outages 

 

Test Scenarios

Test IDTest DescriptionExpected Results
1Add new "column" to data dictionary200
2Add duplicate "column" to data dictionary400
3Delete an existing column from dictionary200
4Delete a non existing column from dictionary400
5Update column properties in dictionary with valid new properties200
6Update column properties in dictionary with invalid new properties400

Releases

Related Work

 

Future work