Preview (SDK and Distributed)

Problem:

When developer is creating a pipeline, there is no way for him to verify whether the plugins are configured correctly. First he will have to publish the pipeline and then run it once to verify if everything is working fine. If not, then he has to clone the pipeline and repeat the entire process again. 

Solution:

We proposed preview feature for the hydrator pipelines. This would allow developer to run the pipeline without publishing it. It will also allow developer to review how the data is passed through each stage, which would help him configuring the pipeline.

Running a pipeline in preview mode is nothing but running the actual CDAP program. This result into the updates to the CDAP metadata tables (such as generating new run record etc.). However preview is only meant to verify the pipeline so we want to isolate the updates that are made while previewing the pipeline. To allow this we will create separate space where the data will be written while previewing the pipeline.

Terminology:

  1. Preview space: When we run program, we update the metadata tables. Program might also create additional datasets. Preview of a pipeline is nothing but the actual program run. During preview we do not want to update the metadata tables and create new datasets since the preview is meant to be used for debugging purpose. In order to keep the updates happened during preview isolated, we create separate instances of the tables which will be used during preview. This is referred to as preview space.

  2. Preview Dataset Framework (PDF): Dataset framework used to access the datasets in the preview space.

  3. Real Dataset Framework (RDF): User may want to read the data from actual sources while running preview. In such cases we will use the dataset framework which can read the datasets from real space called RDF. Please note that we will only allow READing from the real datasets using RDF. Writing will always happen to the datasets in the preview space using PDF.

Caveats:
Even if we are attempting to isolate the preview space, we will still need some metadata information from the real space - 
  1. PreferenceStore to lookup for the preferences with which pipeline can be started.
  2. Namespace metadata to get the information about the existing namespaces in the real space. So that when preview is attempted in the new namespace, that namespace can be created in the preview space.
  3. Secure store for resolving the macros

Preview in SDK:

  1. PreviewService will be started when the CDAP is started.
  2. Data in the preview space will be stored in the "data/preview" directory.
  3. PreviewService will be responsible for running the PreviewHttpHandler.
  4. Request to the preview endpoint is given by user with the appropriate configurations. Note that the configurations will include the configs understood by CDAP (such as ProgramType, ProgramName) and the configs understood by the app (such as stage to run, input data etc in case of Hydrator pipeline).
  5. Since the preview request is namespaced, we will check if the namespace exists in the real space. If it does then the namespace with the same name will be created in the preview space.
  6. CDAP will generate unique preview id for this request which is returned to the user. 
  7. The preview id returned to the user will be used further to query the status of the preview, data generated during preview run, and also to stop the preview if it is running for a long time.
  8. Upon receiving the preview request, Hydrator app will be configured based on the application configurations. For example for single stage preview configuration, we can add Worker in the app which will run the transform.
  9. CDAP platform will determine which program in the application is require to execute based on the preview configurations provided for CDAP.
  10. Metrics: Metrics data will be stored in separate preview space.
  11. Logging: How the logging will work in SDK since we use file appender for the real space.
  12. Preview data will be deleted periodically.

Preview in Distributed:

  1. Preview service will run in a separate container. The container will be started when the master is started and will keep running.
  2. Data generated by the preview system will be stored locally to the container. We can use the leveldb database similar to the standalone mode.
  3. PreviewHttpHandler will be exposed through the preview container.
  4. Logging for preview: Preview container will use the local log appender similar to the SDK.
  5. Metrics for preview: Metrics data will be stored locally in the level db.
  6. Authorization: We store user privileges in Sentry. User is allowed to execute the program if he has EXECUTE permissions on it. This is currently managed by AuthorizationEnforcer. We can inject same instance in the Preview container so that reading and writing to the user datasets will be controlled by privileges in the sentry.
  7. Impersonation: We will use impersonation in preview. Currently we store impersonation configurations in the Namespace meta store. NamespaceQueryAdmin is responsible for reading those configs. Preview container will need access to the instance of the query admin which will query the actual HBase table.
  8. Instances of the preview container can be increased for scalability, however in that case since the preview data is local to the container, request for the preview data will need to be
    routed to the appropriate container which handled the preview request.
    1. One approach to achieve this is store the mappings from preview id to container id in the HBase. These mappings can be cached in the container.
    2. Another approach is to keep pushing the preview data to HDFS once the preview is complete, so that it can be server from multiple containers.
  9. Preview data will be deleted periodically.

Open Questions:

  1. If we read from the real dataset, how should it affect the lineage/tracking?
  2. Preferences need to be read from the actual dataset framework.
  3. How should preview work for Action plugins?
  4. Logs for the preview in SDK
  5. How will preview work for realtime data pipeline?
  6. Since datasets created in the preview space are not visible to user, will there be use case to explore them?
  7. Does deletion happens for the Preview data only or the user datasets created in the preview space as well?