...
Table of Contents |
---|
Checklist
- User Stories Documented
- User Stories Reviewed
- Design Reviewed
- APIs reviewed
- Release priorities assigned
- Test cases reviewed
- Blog post
Introduction
Currently we want user to input the datasets in real space that will be used for the preview pipeline. Otherwise, we by default read from the dataset in preview space, which will result in creating an empty dataset in preview space and read nothing.
Goals
...
Dynamically figure out what is the real dataset and make sure writing is not happen in preview space.
Approach:
...
User Stories
User will be able to read from datasets from real space for preview run without specifying the dataset names.
Design
We will get rid of the realDatasets set which was originally designed to be provided by user. Now PreviewDatasetFramework is only responsible for creating datasets in preview space. When get the dataset, we will get system datasets from preview space and user dataset from real space, or if the arguments specify which space to read from.
Approach
The following parts are needed:
- Provide NoopOutputFormat for the sink
- Since if the dataset name exists in real space, it is impossible to tell whether to use it for the pipeline. So we want to have a NoopOutputFormat which will write nothing to the dataset.
- In order to make sure writing will not happen in real space, we will not let sink write anything to the datasets.
- With a NoopOutputFormat, no write will happen for the sink and in the meantime, all the logic in transform will be preserved and tested by the preview run.
- Expose if the pipeline is running in preview mode to the source plugins
- Since our pipeline accepts runtime arguments for the name of the plugin properties, sometimes we will not know the name of the dataset until runtime, therefore, letting the plugin know the pipeline is running in preview mode will help us read and create the dataset.
- Some of the sources will create dataset at runtime to do some writing, e,g. FileBatchSource has a timeTable which records the last read time. We need to make sure we do not create these datasets in real space while running in preview mode.
- For datasets that requires reading, we check if the dataset exists in real space, if so we read from it. If not, create one in preview space. For datasets that requires writing, we ONLY create in preview space.
- Improve PreviewStore performance(this will be in 4.2.0)
- Since the records are StructuredRecord, it is easy to serialize it.
API changes
New Programmatic APIs
New Java APIs introduced (both user facing and internal)
Deprecated Programmatic APIs
New REST APIs
Path | Method | Description | Response Code | Response |
---|
CLI Impact or Changes
- Impact #1
- Impact #2
- Impact #3
UI Impact or Changes
- Impact #1
- Impact #2
- Impact #3
Security Impact
What's the impact on Authorization and how does the design take care of this aspect
Impact on Infrastructure Outages
System behavior (if applicable - document impact on downstream [ YARN, HBase etc ] component failures) and how does the design take care of these aspect
Test Scenarios
Test ID | Test Description | Expected Results |
---|---|---|