Figure out the real datasets for preview

Checklist

  • User Stories Documented
  • User Stories Reviewed
  • Design Reviewed
  • APIs reviewed
  • Release priorities assigned
  • Test cases reviewed
  • Blog post

Introduction

Currently we want user to input the datasets in real space that will be used for the preview pipeline. Otherwise, we by default read from the dataset in preview space, which will result in creating an empty dataset in preview space and read nothing.

Goals

Dynamically figure out what is the real dataset and make sure writing is not happen in preview space.

User Stories 

User will be able to read from datasets from real space for preview run without specifying the dataset names.

Design

We will get rid of the realDatasets set which was originally designed to be provided by user. Now PreviewDatasetFramework is only responsible for creating datasets in preview space. When get the dataset, we will get system datasets from preview space and user dataset from real space, or if the arguments specify which space to read from.

Approach

The following parts are needed:

  1. Provide NoopOutputFormat for the sink
    1. Since if the dataset name exists in real space, it is impossible to tell whether to use it for the pipeline. So we want to have a NoopOutputFormat which will write nothing to the dataset.
    2. In order to make sure writing will not happen in real space, we will not let sink write anything to the datasets. 
    3. With a NoopOutputFormat, no write will happen for the sink and in the meantime, all the logic in transform will be preserved and tested by the preview run. 
  2. Expose if the pipeline is running in preview mode to the source plugins
    1. Since our pipeline accepts runtime arguments for the name of the plugin properties, sometimes we will not know the name of the dataset until runtime, therefore, letting the plugin know the pipeline is running in preview mode will help us read and create the dataset.
    2. Some of the sources will create dataset at runtime to do some writing, e,g. FileBatchSource has a timeTable which records the last read time. We need to make sure we do not create these datasets in real space while running in preview mode. 
    3. For datasets that requires reading, we check if the dataset exists in real space, if so we read from it. If not, create one in preview space. For datasets that requires writing, we ONLY create in preview space.
  3. Improve PreviewStore performance(this will be in 4.2.0)
    1. Since the records are StructuredRecord, it is easy to serialize it.

API changes

New Programmatic APIs

New Java APIs introduced (both user facing and internal)

Deprecated Programmatic APIs

New REST APIs

PathMethodDescriptionResponse CodeResponse

CLI Impact or Changes

  • Impact #1
  • Impact #2
  • Impact #3

UI Impact or Changes

  • Impact #1
  • Impact #2
  • Impact #3

Security Impact 

What's the impact on Authorization and how does the design take care of this aspect

Impact on Infrastructure Outages 

System behavior (if applicable - document impact on downstream [ YARN, HBase etc ] component failures) and how does the design take care of these aspect

Test Scenarios

Test IDTest DescriptionExpected Results
   

Releases

Release 4.1.1

Related Work

Future work