Problem: Currently we want user to input the datasets in real space that will be used for the preview pipeline. Otherwise, we by default read from the dataset in preview space, which will result in creating an empty dataset in preview space and read nothing.
Goal: Dynamically figure out what is the real dataset and make sure writing is not happen in preview space.
Approach:
Overview: We will get rid of the realDatasets set which was originally designed to be provided by user. Now PreviewDatasetFramework is only responsible for creating datasets in preview space. When get the dataset, we will get system datasets from preview space and user dataset from real space, or if the arguments specify which space to read from.
- Provide NoopOutputFormat for the sink
- Since if the dataset name exists in real space, it is impossible to tell whether to use it for the pipeline. So we want to have a NoopOutputFormat which will write nothing to the dataset.
- In order to make sure writing will not happen in real space, we will not let sink write anything to the datasets.
- With a NoopOutputFormat, no write will happen for the sink and in the meantime, all the logic in transform will be preserved and tested by the preview run.
- Expose if the pipeline is running in preview mode to the source plugins
- Since our pipeline accepts runtime arguments for the name of the plugin properties, sometimes we will not know the name of the dataset until runtime, therefore, letting the plugin know the pipeline is running in preview mode will help us read and create the dataset.
- Some of the sources will create dataset at runtime to do some writing, e,g. FileBatchSource has a timeTable which records the last read time. We need to make sure we do not create these datasets in real space while running in preview mode.
- For datasets that requires reading, we check if the dataset exists in real space, if so we read from it. If not, create one in preview space. For datasets that requires writing, we ONLY create in preview space.
- Improve PreviewStore performance
- Since the records are StructuredRecord, it is easy to serialize it.