Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Terminology:

...

  1. PreviewService will be started when the CDAP is started.
  2. Data in the preview space will be stored in the "data/preview" directory.
  3. PreviewService will be responsible for running the PreviewHttpHandler.
  4. Request to the preview endpoint is given by user with the appropriate configurations. Note that the configurations will include the configs understood by CDAP (such as ProgramType, ProgramName) and the configs understood by the app (such as stage to run, input data etc in case of Hydrator pipeline).
  5. Since the preview request is namespaced, we will check if the namespace exists in the real space. If it does then the namespace with the same name will be created in the preview space.
  6. CDAP will generate unique preview id for this request which is returned to the user. 
  7. The preview id returned to the user will be used further to query the status of the preview, data generated during preview run, and also to stop the preview if it is running for a long time.
  8. Upon receiving the preview request, Hydrator app will be configured based on the application configurations. For example for single stage preview configuration, we can add Worker in the app which will run the transform.
  9. CDAP platform will determine which program in the application is require to execute based on the preview configurations provided for CDAP.
  10. Metrics: Metrics data will be stored in separate preview space.
  11. Logging: How the logging will work in SDK since we use file appender for the real space.
  12. Preview data will be deleted periodically.

Preview in Distributed:

  1. Preview service will run in a separate container. The container will be started when the master is started and will keep running.
  2. Data generated by the preview system will be stored locally to the container. We can use the leveldb database similar to the standalone mode.
  3. PreviewHttpHandler will be exposed through the preview container.
  4. Logging for preview: Preview container will use the local log appender similar to the SDK.
  5. Metrics for preview: Metrics data will be stored locally in the level db.
  6. Authorization: We store user privileges in Sentry. User is allowed to execute the program if he has EXECUTE permissions on it. This is currently managed by AuthorizationEnforcer. We can inject same instance in the Preview container so that reading and writing to the user datasets will be controlled by privileges in the sentry.
  7. Impersonation: We will use impersonation in preview. Currently we store impersonation configurations in the Namespace meta store. NamespaceQueryAdmin is responsible for reading those configs. Preview container will need access to the instance of the query admin which will query the actual HBase table.
  8. Instances of the preview container can be increased for scalability, however in that case since the preview data is local to the container, request for the preview data will need to be
    routed to the appropriate container which handled the preview request.
    1. One approach to achieve this is store the mappings from preview id to container id in the HBase. These mappings can be cached in the container.
    2. Another approach is to keep pushing the preview data to HDFS once the preview is complete, so that it can be server from multiple containers.
  9. Preview data will be deleted periodically.

...

  1. If we read from the real dataset, how should it affect the lineage/tracking?
  2. Preferences need to be read from the actual dataset framework.
  3. How should preview work for Action plugins?
  4. Logs for the preview in SDK
  5. How will preview work for realtime data pipeline?
  6. Since datasets created in the preview space are not visible to user, will there be use case to explore them?
  7. Does deletion happens for the Preview data only or the user datasets created in the preview space as well?