Table of Contents |
---|
Checklist
- User Stories Documented
- User Stories Reviewed
- Design Reviewed
- APIs reviewed
- Release priorities assigned
- Test cases reviewed
- Blog post
Introduction
Preview is a feature that allows users to run a pipeline on a sample of their actual data and inspect the input and output records at each stage. It is useful for debugging during the pipeline development phase. It currently is only available in the CDAP sandbox and unavailable in distributed mode. It often is not possible to really test a pipeline in the sandbox due to the real data being unavailable due to policy, permissions, firewalls, etc. In these situations, preview in distributed mode would be highly useful.
Goals
To enable users to run pipeline preview in CDAP distributed mode.
User Stories
- As a pipeline developer, I want to be able to preview a pipeline in CDAP distributed mode
- As a pipeline developer, I do not want preview runs to write any data to the pipeline sinks
- As a pipeline developer, I want a preview run to finish within a minute
- As a pipeline developer, I want to be able to examine logs for a preview run
- As a pipeline developer, I want to be able to examine metrics for a preview run
- As a cluster admin, I want to be able to cap the amount of cluster resources previews will take
- As a cluster admin, I want preview runs to automatically clean up any data they generate
- As a cluster admin, I want preview runs to put negligible load on external systems
Design
At a high level, there is a PreviewManager that is in charge of starting preview runs and exposing tracer data for a run. The PreviewManager is called by the HTTP Handler to implement the various REST endpoints.
The PreviewManager stores state about which preview runs exist and tracer data in the PreviewStore. Metric and log data are stored in the same place all other metrics and logs are stored. PreviewManager uses the PreviewRunner to actually execute a preview run. This architecture is the same whether CDAP is running in sandbox or distributed mode. What differs in each mode is the specific implementation of the PreviewRunner and the PreviewStore.
Preview Store
Prior to 6.0.0, PreviewStore was an interface for just reading and writing tracer data for preview runs. Preview run state was stored in memory in the PreviewManager. In 6.0.0, the store will be enhanced to also store information about the preview runs instead of keeping it in the PreviewManager. This provides a cleaner separation of duties and allows running multiple instances of the preview http handlers and managers for scalability and redundancy.
Code Block |
---|
public interface PreviewStore {
void putTracer(ApplicationId applicationId, String tracerName, String propertyName, Object value);
Map<String, List<JsonElement>> getTracer(ApplicationId applicationId, String tracerName);
void removeTracers(ApplicationId applicationId);
void addRun(ProgramRunId runId);
List<ProgramRunId> listRuns();
void setRunStatus(ProgramRunId runId, PreviewStatus status);
PreviewStatus getRunStatus(ProgramRunId runId);
RunRecordMeta getRun(ProgramRunId runId);
} |
RunRecords are not exposed by the REST endpoints, but are required to get log information for the preview run.
PreviewRunner
The PreviewRunner works similarly in both sandbox and distributed modesDistributed preview uses the same structure as sandbox preview. Each preview run creates it's own directory where all data (metrics, logs, tracers) is stored. Dataset admin operations on existing datasets are no-ops and admin operations on non-existent datasets occur privately in that preview space. This is done primarily in the DefaultPreviewManager, by creating a separate guice injector for each preview run. Each injector creates the relevant classes that are now configured to write to an isolated preview space.
The runner can start off The design is broken up into two approaches. The difference between approaches is in scalability. In approach 1, previews are run in a single system service instance. This is conceptually very similar to the sandbox runner, where it creates its own lifecycle services and runs the program in memory. The only difference in distributed will be that it talks to distributed versions of the system services. With this type of implementationtaking the sandbox preview manager and running it in a container. The number of concurrent preview runs is limited by the size of the container. In addition, if the container dies, all preview state is lost on container restart. In approach 2, key data is persisted to shared storage, which allows spreading preview across multiple instances and recovering state after container death.
Approach 1
In the first approach, the PreviewHttpHandler and PreviewManager is moved to a CDAP system service, running in its own container. The actual mechanics of preview runs is largely the same as in sandbox mode. Each run writes to its own local TMS, has its own local entity, app meta, log, and metric store written to local disk. It instantiates its own AppLifecycleService to deploy the app and its own ProgramLifecycleService to run the program using an InMemoryProgramRunner instead of a DistributedProgramRunner. The preview service keeps a configurable number of preview runs in memory, evicting the oldest run when more space is needed. In this approach, the number of concurrent preview runs is limited by the resources of the PreviewManager.
Later, the PreviewRunner may actually spawn another container to execute the program run.
Scaling and Redundancy
The sandbox PreviewManager is not scalable because it stores all of its state in memory and/or on local disk. As long as the PreviewStore writes to a distributed store, CDAP can run multiple instances of the Preview http handler and manager to offer better scalability and reliability.the container size. All data is local to the container and is lost if the container is killed or dies. Since data is local, only a single preview service instance can be running at any given time.
Approach 2
In the second approach, the preview run, data tracer, log, and metric stores are moved from local stores to shared persistent storage. The preview program run is still executed locally with a preview run specific AppLifecycleService, ProgramLifecycleService, TMS, entity store, and app meta store. However, log, metric, and data tracer data is stored in a shared store instead of locally. In addition, the preview service tracks preview runs in shared storage instead of in memory. If a service instance dies, anything runs in progress in that container will be lost, and eventually failed by a janitor process. However, preview runs can now be distributed across multiple instances and served from any instance. This allows preview runs to scale horizontally.
In terms of implementation, approach 1 is a stepping stone to approach 2.
API changes
New Programmatic APIs
None
Deprecated Programmatic APIs
None
New REST APIs
None
Deprecated REST API
None
CLI Impact or Changes
None
UI Impact or Changes
None
Security Impact
What's the impact on Authorization and how does the design take care of this aspect
Impact on Infrastructure Outages
Preview cleanup needs to be able to survive the death of the preview service instead of handled completely in memory.
Test Scenarios
Test ID | Test Description | Expected Results |
---|---|---|
1 | Preview a mapreduce pipeline in distributed CDAP | |
2 | Preview a spark pipeline in distributed CDAP | |