Goals
Adding a Data Wrangler will improve the overall user experience for creating schemas, and facilitate easier of importing data.
Checklist
- User stories documented(Todd)
- Requirements documented(Todd)
- Requirements Reviewed
- Mockups Built
- Design Built
- Design Accepted
User Stories.
- As a Hydrator user in my pipeline after a source plugin, I want a transform node that allows me to graphically build a schema to be used in my pipeline.
- As a Hydrator user I want the new transform node type to be able to operate anywhere in my pipeline after a Source is defined.
- As a Hydrator user I want the nmentedew transform node to make a best effort to determine if the first row of a file I’m importing is different so that I can quickly determine if a header row exists.
- As a Hydrator user I want the new transform node to understand common delimiters so that I can parse my data into columns and fields.
- As a Hydrator user I want the new transform node to allow configuration of column fields (name/type/reorder/include/drop/merge/split) from sources in my pipeline, using a graphical interface
- As a Hydrator user I want the new transform node to provide easily visible statistics for data quality and a histogram for distribution. I want these to be viewable at the column level for each field in my source.
- As a Hydrator user I want the new transform node to provide a history of all steps I perform on a document to be available to me.
Requirements
General
- The new tool can be instantiated as a new transform node type from inside Hydrator.
- The tool should also be accessible outside of Hydrator.
- The input for the tool should be schema OR a json representation of sampled data in JSON. The new tool should not configure data sources.
- The output for the tool should be an output schema in JSON and the DSL for performing the transformations.
- State, including sample, should be preserved upon "saving" and returning from within the pipeline.
Supported Operations
Sample Data/Schema inference
- The tool should receive data for graphical presentation when it is in preview mode without explicit direction from the user.
- The tool should accept copy and paste or file upload or http rest endpoint for sampling data.
- The default value for number of records/rows/documents sampled should be 1000, and user definable.
- The tool should make a best effort attempt to determine delimiter.
- The tool should make a best effort attempt to determine if a header row is present.
- The tool should make a best effort attempt to determine if there are encapsulating delimiters, double quotes.
- The tool should allow user specification of delimiter, from the screen, or from a dropdown:
- comma
- semicolon
- tab
- pipe
- Caret
- Custom (any unicode value)
- The tool will make a best effort attempt at determining type for each column.
Column Operations
- Drop Columns. Columns should be droppable from one button. There should be a global option to show dropped columns, as grayed out in the UI.
- Reorder Columns. They should be draggable to reorder columns
- Rename Columns. The names should be an input field that is rename-able.
- Type. Type should be selectable from a drop down menu.
- Split. Columns should be able to be split based on an expression or delimiting character.
- Merge. Columns of the same type should be able to be merged with other columns based on the following operators:
- String
- Concatenate with char/space
- Deduplicate
- Replace (if < or >)
- Numeric
- Sum
- Average
- Subtract (limited to two columns)
- Divide (limited to two columns)
- Deduplicate
- Replace (if < or >)
- Modulus (limited to two columns)
- String
- Data Quality Score. For each column a data quality score should be presented to indicate percentage of nulls, and percentage of outliers.
- Histogram displays the count of each detected value in the column (for string data) or the count of values within a numeric range (for number data). (Source: Trifecta Histograms, link)
Bulk Column Editing Operations should be possible from a single view to:
- Rename
- Drop
- Reorder
- Change Type
- Merge
- Deduplicate
Steps viewer to:
- View all previous steps.
- Rollback to a previous point. Rollback will destroy all operations between current step and rollback point. There will be no in process editing of stepss
Future Considerations
- Date/time support as a field type, and date/time functions
Design
Check for header workflow: