Scaling preview system on Kubernetes

Checklist

  • User Stories Documented
  • User Stories Reviewed
  • Design Reviewed
  • APIs reviewed
  • Release priorities assigned
  • Test cases reviewed
  • Blog post

Introduction 

The Preview system in CDAP has 2 components - 

  • Preview Manager/ Http Handler: REST api for user interaction and retrieving preview data

  • Preview Runner: Responsible for executing preview


CDAP while running on Kubernetes environment executes previews (preview manager and runner) in a single pod. As a result it has several limitations:

  • Each preview run takes a certain amount of memory depending on the pipeline structure. For example number of records read in preview from each source are limited to 100, however pipeline can contain many sources in which case total number of records read by preview will be 100 * number of sources in the pipeline. This could result in the preview pod running out of memory.

  • There is a limit on the number of concurrent previews that can be run. Currently this limit is set to 5.

  • Preview http handler depends on the in-memory LRU cache to retrieve the preview data. This cache maintains the guice injectors required to fetch the data from the disk. If injector is evicted from the cache preview data is no longer available.

  • Having a single pod requesting large resources for CPU and memory can cause Kubernetes schedulability issues.

Goals

Since the running preview in a single pod has several memory and scale limitations, we need to improve the system by running previews in their own pods.

Design

Preview Runner pods as StatefulSet

The PreviewManager pod is created by the CDAP operator. PreviewManager in turn will be responsible for creating Kubernetes StatefulSet objects with n replicas, each of which will be used for running the preview runs. Configuration parameter n will depend on the Kubernetes cluster size.

Building Blocks

Following are the main building blocks of the system.

Preview Manager

  • Runner Launcher

    • Responsible for creating StatefulSets for preview runners.

    • Number of replicas in the StatefulSet will be determined by the cluster size and will be configurable.

    • Preview requests will be rejected if Kubernetes fails to deploy a StatefulSets object.

  • Preview Http Handler

    • Http handler for public REST APIs which exists currently.

    • API to be used by Preview Runner

      • /poll for polling job requests.

    • Get rid of injector cache in the preview manager.

  • Preview Request Job Queue

    • Queue for holding the preview requests submitted by users which are waiting to run.

    • This queue will contain the generated ApplicationID along with the config JSON provided by the user.

    • Although not very memory intensive, the number of elements in the JobQueue will be limited by configuration parameters to limit its impact on PreviewManager’s memory.

    • If preview manager pod dies, request queue can be built from the disk.

  • TMS Subscriber

    • Responsible for collecting preview data and logs from preview runner pods. 

    • CDAP Global TMS will be used for data transfer between Preview Runner and Preview Manager (please see the block diagram above).

  • Data cleanup thread

    • Data cleanup thread which cleans preview data for runs more than a week old.

    • Since preview stores the user data for compliance purposes we can keep deleting the data that is older than the configurable amount of time.


Preview Runner

  • Preview Job Poller

    • Polls for Jobs from PreviewManager and if one is available starts executing it.

  • Artifact Synchronizer

    • Periodically syncs artifacts from GCS bucket to the PD attached to the pod for runtime optimization.

  • Preview Runtime

    • Responsible for running the preview.

    • Since preview executes user code, for security reason such as user might run a long running script on the pod, once the preview run completes the pod will be restarted. 

  • Preview Data Publisher

    • Publish data, status, metrics, and logs to the PreviewManager through CDAP global TMS.

StatefulSet vs Deployment

Preview runner pods are essentially stateless. So Kubernetes Deployments with PersistentVolumes seems like a natural choice for them with the ability to share the volumes between the pods. However there are few drawbacks of using Deployments with PersistentVolumes for preview runs:

  1. Data to the volumes will be written by multiple pods so the access policy on volumes is required to be ReadWriteMany. However persistent volumes that are backed by GCE persistent disks don’t support this option.

  2. Deployments can be backed by Kubernetes Service object which can expose preview runner services such as accepting preview requests to be run. Services can then load balance across multiple backend pods. However it is possible that the request will get routed to a pod which is already handling another preview run. This can lead to potential OOM for that specific preview runner pod causing data loss.

Using StatefulSet for preview runner pods helps PreviewManager to better orchestrate the preview runs across available pods so that there is no overlapping run. In this case PreviewManager can control the pod to which the request to be sent using the DNS name of the pod.

Get data back to Preview Manager from Preview Runner

Preview Manager needs to collect preview data, program status, logs, and metrics from Preview Runner. It was decided to use CDAP Global TMS for this. Following options were considered.

CDAP Global TMS

  • Pros

    • Can take advantage of the fault tolerant mechanism that is built in TMS

    • Same treatment for different pieces of data such as preview data, status, logs, and metrics

    • Data is available in the streaming fashion while the preview is running.

  • Cons

    • Additional dependency on the global TMS. We will need to load test TMS. 

Common cloud provider stores such as Cloud SQL/GCS on GCP, S3/redshift on AWS

  • Pros

    • Simplified design

  • Cons

    • Additional cloud provider dependency

    • Will need to handle failure scenarios

    • Shipping data via GCS/S3 may impact performance and data won't be available in streaming fashion.

REST API

  • Pros:

    • Might be faster to send data back.

  • Cons

    • Program states currently reliably transferred through TMS. We can use TMS for program states and REST API for the logs, metrics, data. However this complicates the architecture where different ways are used for communicating the information back.

Local TMS

  • Pros:

    • Avoid load on the Global TMS

  • Cons:

    • TMS service will need to be exposed on a separate IP address. IP Address space is already limited.

User Experience with API

Following will be the bootstrap process for PreviewManager and PreviewRunner pods for handling failure scenarios.

Bootstrap PreviewManager:

Build an in-memory map of <PreviewRunnerID, ApplicationID> (PreviewRunnerApplication map) by calling HTTP API on the preview runners.

Bootstrap PreviewRunner:

When the preview runner pod is started, it will check for an existence of the application file on the persistent store attached to it. If the file exists, the preview runner will emit the TMS message with ProgramRunStatus as failed for the application and run information stored in the application file. Once the TMS message is emitted, the file will be deleted and the preview runner will be ready for polling jobs from the preview manager.

Get Status API

  • If preview Application ID is in JobQueue, return WAITING status with extra information about how many applications are ahead in the queue.

  • Query the AppMetaStore for the application run. If such a record exists, return the status.

  • If Application ID exists in the PreviewRunnerApplication map, return status as INIT. This will only be required when the /status endpoint is called after PreviewRunner has acquired preview run, however ProgramRunStatus has not reached back to the PreviewManager yet possibly because of a TMS delay.

  • Return NOT_FOUND. This can happen for non-existent application ids as well as application ids which were in WAITING state and the preview manager pod was restarted.

Kill Preview

  • If the preview is in WAITING, we will simply remove it from the JobQueue and return the status as REMOVED/OBSOLETE rather than KILLED. When next /status endpoint is called with the same application id it will return NOT_FOUND.

  • If ApplicationID exists in the PreviewRunnerApplication map, /kill API on the corresponding preview runner pod will be called.

  • Application ID will be checked in the preview manager store.

    • If the run is already in one of the terminal stages, HTTP CONFLICT code will be returned. 

    • If run is not in the terminal stage, (should not happen) log a warning and return KILL.

Submit Preview

  • User submits a preview request to Preview Http Handler.

  • If JobQueue is already full, reject the request with out of capacity error else add the request to the JobQueue.

  • /status and /kill endpoints will be served as explained earlier.

  • Preview runner pod polls the JobQueue. If the job is available it will start executing the job.

  • Job will be removed from the queue as soon as preview runner *acquired* it. Preview manager will also add this tuple<PreviewRunnerID, ApplicationID> to the PreviewRunnerApplication map.

  • Preview runner pod will write the application and run information to the local file system for handling failure scenarios as explained in the bootstrap process above.

  • Once the preview run is started, messaging subscribers in the PreviewManager will listen for the preview status change notifications as well as preview data and dump that data on the attached PD in /data/preview/appid directory. For PreviewManager to store the data as well as status of the preview run we might want to add additional leveldb tables.

  • Preview runners can time out the preview run as well similar to how it's done currently.

Get data

  • Should be similar to the current structure where PreviewManager returns the data from PD.

    • This API needs to be paginated.

Get logs

  • Similar to the current architecture

Get preview system metric

  • Endpoint to return total WAITING and RUNNING request counts.

UI Impact or Changes

  • Preview request will be in WAITING state after submitted and before picked up for execution by preview runners. UI will need to handle this case.
  • When preview request is in WAITING state the /status API will also return the number of preview requests ahead in the queue to set the expectations for user.

Security Impact 

NONE

Test Scenarios

Test IDTest DescriptionExpected Results












Releases

Release X.Y.Z

Release X.Y.Z

Related Work

  • Work #1
  • Work #2
  • Work #3


Future work