Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Table of Contents

Checklist

  •  User Stories Documented
  •  User Stories Reviewed
  •  Design Reviewed
  •  APIs reviewed
  •  Release priorities assigned
  •  Test cases reviewed
  •  Blog post

Introduction 

Preview is a feature that allows users to run a pipeline on a sample of their actual data and inspect the input and output records at each stage. It is useful for debugging during the pipeline development phase. It currently is only available in the CDAP sandbox and unavailable in distributed mode. It often is not possible to really test a pipeline in the sandbox due to the real data being unavailable due to policy, permissions, firewalls, etc. In these situations, preview in distributed mode would be highly useful.

Goals

To enable users to run pipeline preview in CDAP distributed mode.

User Stories 

  1. As a pipeline developer, I want to be able to preview a pipeline in CDAP distributed mode
  2. As a pipeline developer, I do not want preview runs to write any data to the pipeline sinks
  3. As a pipeline developer, I want a preview run to finish within a minute
  4. As a pipeline developer, I want to be able to examine logs for a preview run
  5. As a pipeline developer, I want to be able to examine metrics for a preview run
  6. As a cluster admin, I want to be able to cap the amount of cluster resources previews will take
  7. As a cluster admin, I want preview runs to automatically clean up any data they generate
  8. As a cluster admin, I want preview runs to put negligible load on external systems 

Design

Distributed preview uses the same structure as sandbox preview. Each preview run creates it's own directory where all data (metrics, logs, tracers) is stored. Dataset admin operations on existing datasets are no-ops and admin operations on non-existent datasets occur privately in that preview space. This is done primarily in the DefaultPreviewManager, by creating a separate guice injector for each preview run. Each injector creates the relevant classes that are now configured to write to an isolated preview space.

The design is broken up into two approachesseveral phases. The difference between approaches each phase is in scalability, reliability, and data persistence. In approach phase 1, previews are run in a single system service instance. This is conceptually very similar to taking the sandbox preview manager and running it in a container. The number of concurrent preview runs is limited by the size of the container. In addition, if the container dies, all preview state is lost on container restart. In approach phase 2, key data is persisted to shared storage, which allows spreading preview across multiple instances and recovering state after container death. In phase 3, preview run execution takes place outside the preview service, providing isolation between preview runs and providing automatic scaling for the runs themselves.

Approach

Phase 1

In the first approachphase, the PreviewHttpHandler and PreviewManager is moved to a CDAP system service, running in its own container. The actual mechanics of preview runs is largely the same as in sandbox mode. 

Image Added

Everything inside the Preview Run box is local to that preview run. The Preview Run is executed in memory, in the same process as the Preview Service.

Each run writes to its own local TMS, has its own local entity, app meta, log, and metric store written to local disk. It instantiates its own AppLifecycleService to deploy the app and its own ProgramLifecycleService to run the program using an InMemoryProgramRunner instead of a DistributedProgramRunner. The preview service keeps a configurable number of preview runs in memory, evicting the oldest run when more space is needed. In this approach, the number of concurrent preview runs is limited by the container size of the Preview Service. All data is local to the container and is lost if . If the container is killed or dies, all preview information is lost. Since data is local, only a single preview service instance can be running at any given time.

Image Removed

Approach

Phase 2

In the second approachphase, the preview run, data tracer, log, and metric stores are moved from local stores to shared persistent storage.

Image Added

The preview program run is still executed locally with a preview run specific local AppLifecycleService, ProgramLifecycleService, TMS, entity store, and app meta store.  However, log, metric, and data tracer data is stored in a shared store instead of locally. In addition, the preview service tracks preview runs in shared storage instead of in memory. If a service instance dies, anything runs in progress in that container will be lost, and eventually failed by a janitor process. However, preview runs can now be distributed across multiple instances and served from any instance. This allows preview runs to scale horizontally.Image Removed

In terms of implementation, approach 1 is a stepping stone to approach 2. 

Phase 3

In the third phase, a new implementation of the PreviewRunner is created that executes the preview run in a separate container. YARN is too slow, but it may be possible to do this in Kubernetes. This would decouple program execution from data serving, allowing more natural scaling of the service. In this model, metrics and logs would be fetched from the remote container using a similar mechanism to that done by the RuntimeMonitor used for remote cluster runs.

Image Added


API changes

New Programmatic APIs

None

Deprecated Programmatic APIs

None

New REST APIs

None

Deprecated REST API

None

CLI Impact or Changes

None

UI Impact or Changes

None

Security Impact 

What's the impact on Authorization and how does the design take care of this aspect

Impact on Infrastructure Outages 

Preview cleanup needs to be able to survive the death of the preview service instead of handled completely in memory.

Test Scenarios

Test IDTest DescriptionExpected Results
1Preview a mapreduce pipeline in distributed CDAP
2Preview a spark pipeline in distributed CDAP






Releases

Release 6.0.0

Related Work

Future work