Table of Contents |
---|
Checklist
- User Stories Documented
- User Stories Reviewed
- Design Reviewed
- APIs reviewed
- Release priorities assigned
- Test cases reviewed
- Blog post
Introduction
A data dictionary will be a way for users to define and describe columns that apply across all datasets in a namespace and allow users to enforce a common naming convention, type, and indicate if the column contains PII data. Anyone creating new datasets or browsing datasets would then be able to see and leverage this information in Tracker.
Goals
After the completion of this project users will have a single point of governance for their data fields across namespace.
Use Cases
- A team would like to make sure that any new datasets they create use the same naming conventions for field names. For example, in all datasets, the accountId is always in the same format (not accountID or account_id), and always contains the same type of data (always a nullable string, not an int or long). Using the data dictionary, the team sets the name, format, and definition of the column, and enables compliance checking. Now, anyone browsing through the datasets available in the namespace through Tracker can see the definitions for all the columns. If the user finds a dataset which is out of compliance, he/she can take action to bring it into compliance.
- A company has specific fields in their datasets that are sensitive such as SSN or full name. They would like these columns to be tracked across all datasets so that the users of the system know to treat it with a higher level of secrecy. By adding these columns to the data dictionary, they are able to identify these columns across any dataset and make sure they are keeping the data safe. As new datasets are created with the same columns, the data dictionary is automatically applied and the columns are marked as PII. If desired, someone would be able to write a script to automatically add tags to datasets containing PII columns and audit them to make sure only specific people have access to them.
User Stories
- As a user of Tracker, I would like to define and enforce a common data dictionary across all my datasets.
- As a user of Tracker, I would like to see when a specific field in a dataset is not in compliance with my data dictionary.
- As a user of Tracker, I would like to view the reasons why my specific field is not in compliance with my data dictionary.
- As a user of Tracker, I would like to mark specific fields as PII or not.
- As a user of Tracker, I would like to be able to easily add existing fields to the Data Dictionary.
Design
Dictionary
- The data dictionary will be stored in a new custom dataset specific to a namespace and backed by a Table.
- The name will be _dataDictionary
- The schema of the dictionary table will be as follows:
- rowKey - the column name, all lowercase
- columnName - the column name with the case preserved
- columnType - the type of the column
- isNullable - is the column nullable or not
- isPII - is the column PII or not
- description - the description of the column
- datasets - a list of datasets containing the column
- numberUsing - will default to 1, and will be increased if other extensions start using this data
Config Options
- As part of this, we will also need to introduce the notion of configuration preferences for Tracker in order to store the state of compliance checks for this instance.
- We will create a new config dataset which will be a simple key-value store to keep track of this information. The goal would be to store additional config values in there as needed by future tracker features.
- We will create a new config api handler to manage the configuration options.
- Config keys will be in the form of <feature>.<option>. So for the compliance checks for the dictionary, the config key would be "dictionary.compliance" with a value of "true" or "false"
New Programmatic APIs
New REST APIs
Dictionary
- We will need the following additional endpoints to support the data dictionary which will be added in a new handler class
- MethodEndpointDescriptionParamsSample Response
GET /v1/dictionary Returns the entire data dictionary for the namespace none On success, returns a 200.
{
"results"
: [] }
{
"results"
: [
{
"columnName"
:
"testColumn1"
,
"columnType"
:
"string"
,
"isNullable"
:
true
,
"isPII"
:
false
,
"description"
:
"something something something"
},
{
"columnName"
:
"testColumn2"
,
"columnType"
:
"long"
,
"isNullable"
:
false
,
"isPII"
:
true
,
"description"
:
"else else else"
}
]
}
POST /v1/dictionary Returns the data dictionary related to the specified schema payload - a schema containing a list of columns On success, returns a 200.
{
"results"
: [] }
{
"results"
: [
{
"columnName"
:
"testColumn1"
,
"columnType"
:
"string"
,
"isNullable"
:
true
,
"isPII"
:
false
,
"description"
:
"something something something"
},
{
"columnName"
:
"testColumn2"
,
"columnType"
:
"long"
,
"isNullable"
:
false
,
"isPII"
:
true
,
"description"
:
"else else else"
}
]
}
POST /v1/dictionary/validate Validates the given schema against the current data dictionary payload - A json string containing the schema to validate {
"status"
:
"OK"
,
"results"
: [] }
{
"status"
:
"NOT_OK"
,
"results"
: [
{
"
"columnName"
:
"nameOfColumnFromSchema"
,
"expectedName"
:
"nameOfColumnFromDict"
,
"columnType"
:
"string"
,
"expectedType"
:
"string"
,
"isNullable"
:
true
,
"expectedNullable"
:
true
,
"reason"
:
"The column case did not match the data dictionary."
},
{
"columnName"
:
"nameOfColumnFromSchema"
,
"reason"
:
"The column does not exist in the data dictionary."
}
]
}
POST /v1/dictionary/{column-name} Add a column to the data dictionary column-name - the name of the column to add (Case Sensitive) payload - a JSON string in the format:
{
"columnType"
:
"String"
,
"isNullable"
:
true
,
"isPII :
false
,
"description"
:
"this is a description of the column"
}
On success, returns a 200. PUT /v1/dictionary/{column-name} Update a row in the data dictionary column-name - the name of the column to add (Case Sensitive)
payload - a JSON string in the format:
{
"columnType"
:
"String"
,
"isNullable"
:
true
,
"isPII :
false
,
"description"
:
"this is a description of the column"
}
On success, returns a 200. If the field is not found, returns a 404. DELETE /v1/dictionary/{column-name} Delete a column from the data dictionary column-name - the name of the column to delete (Case Sensitive) On success, returns a 200. If the field is not found, returns a 404. If the numberUsing > 1, return 409 conflict.
Configuration
- MethodEndpointDescriptionParamsSample Response
GET /v1/config Returns the entire tracker config as a key value map none {
"config-key-1"
:
"config-value-1"
,
"config-key-2"
:
"config-value-2"
}
If no values are set, an empty map is returned.
{}
GET /v1/config/{config-key} Returns the value for the given config key. This will perform a row scan of the table, so partial values are accepted. config-key the key for the config to return [{
"config-key"
:
"config-value-1"
}]
If multiple results are found:
[
{
"config-key-1"
:
"config-value-1"
},
{
"config-key-2"
:
"config-value-3"
}
]
Returns 404 if no config key was found.
[]
POST / PUT /v1/config/{config-key} Sets the value of the specified config key config-key the key for the config to set payload the value for that config Returns 200 if the key was successfully set or updated Returns 400 and an error message if the key or value is invalid for some reason. DELETE /v1/config/{config-key} Deleted the value of the specified config key config-key the key for the config value to delete Returns 200 if the key was deleted Returns 404 if the key was not found
UI Impact or Changes
Tracker will have a tab for data dictionary where users will be able to see and interact with the dictionary for that namespace.
Security Impact
Impact on Infrastructure Outages
Test Scenarios
Test ID | Test Description | Expected Results |
---|---|---|
Releases
Release X.Y.Z
Release X.Y.Z1 | Add new "column" to data dictionary | 200 |
2 | Add duplicate "column" to data dictionary | 400 |
3 | Delete an existing column from dictionary | 200 |
4 | Delete a non existing column from dictionary | 400 |
5 | Update column properties in dictionary with valid new properties | 200 |
6 | Update column properties in dictionary with invalid new properties | 400 |
Releases
Related Work