Checklist

User Stories Documented
User Stories Reviewed
Design Reviewed
APIs reviewed
Release priorities assigned
Test cases reviewed
Blog post

Introduction

In CDAP 5.0, every time a program runs, it will run using a specific profile. The profile specifies where program execution will take place. Execution may take place on the permanent cluster, or it may take place on compute nodes that are provisioned before execution and torn down afterwards. This page documents the design for profiles and provisioning.

Goals

To define what a profile is, the new provisioning lifecycle for program runs, and how these pieces will fit into CDAP. This document will not focus on profile metadata or authorization. Instead, it will focus on functional requirements, like what occurs once a program run begins.

Use Cases

A cluster administrator wants to set up a profile for each team in the organization. Each profile is tied to a separate Google Cloud account for billing purposes. Each team is given their own namespace. The cluster admin would like to configure the namespace so that any batch pipeline will, by default, use the profile for that team. Every run of a batch pipeline will use that profile to first provision a specific number of compute nodes with specific hardware and image as defined in the profile. After a successful run, the nodes are torn down.
A cluster administrator wants to set up three separate profiles, corresponding to small, medium, and large programs. They are all tied to the same Google Cloud account, but each profile uses a different number of nodes and different hardware.
A cluster administrator wants to configure a profile so that compute nodes are torn down immediately after a successful run, but are retained for 8 hours after a failed run in order to allow people to investigate the failure cause. The admin would like to be able to manually tear down the compute nodes before the 8 hours are up. The admin would also like to turn off auto tear down of the nodes in case they are needed for longer than 8 hours.

User Stories

As an admin, I want to create a profile that can be used to provision and tear down nodes for program execution.
As an admin, I want to configure a profile with the AWS account to use, image, hardware, and min and max node count.
As an admin, I want to configure a profile with the Google account to use, image, hardware, and min and max node count.
As an admin, I want to configure a profile to run on the local cluster.
As an admin, I want to be able to distinguish between a provisioning failure and a program execution failure.
As an admin, I want to configure a profile to delay node teardown for a specific amount of time after program execution failed.
As an admin, I want to be able to choose whether nodes should be torn down when stopping a program.
As an admin, I want to be able to manually teardown nodes for a program run that is in a terminal state.
As an admin, I want to be able to set a maximum number of concurrent runs for a batch pipeline, where concurrent means the nodes from a previous run are still provisioned.
As an admin, I want to be able to override provisioner properties that were set on the profile on a per run basis.
As an admin, I want to be able to get statistics about the nodes currently provisioned by a profile.
As an admin, I want to be able to get all programs and schedules assigned to a given profile.
(future work) As an admin, I want to be able to configure runs to use an existing cluster.

Design

Terminology

Profile - CDAP entity that specifies how and where a program will be executed. It encapsulates any information required to setup and teardown the program execution environment, such as the type of cloud provider, account information, hardware information, image, minimum and maximum node count, TTL, etc. A profile is identified by name and must be assigned a provisioner and its related configuration.

Provisioner - Performs the actual runtime actions required for spinning up, bootstrapping, and tearing down nodes for program execution. This will likely be pluggable in the future, but need not be in the first version. There will be a default provisioner that uses the local cluster as the execution environment, an amazon type, and a google type. Each provisioner defines its own set of configuration settings. For example, the Amazon provisioner could have AWS region, IAM, security group info, secret key, etc as configuration, while the Google provisioner could have account, api key, disk type, network name, etc as configuration.

Program Lifecycle

In addition to the current program run state, we will introduce a new cluster state. State transitions will follow the diagram below, with c: labeling the cluster state and p: labeling the program state.

Architecture

We will introduce a new Provisioning service that will be responsible for provisioning and de-provisioning compute nodes. A provisioning subscriber will listen to TMS and call the service.

A provisioner will be able to store and read state, as the provision and de-provision operations need to be implemented in an idempotent way in case there are failures in the middle of an operation. For example, when provisioning nodes, it will need to store state that it is provisioning X nodes for run Y before actually making a request to create a node. As each node is created, it will need to update state for each corresponding node, in case there is a crash while it has created 3 nodes while waiting for 2 more to be created. This way, if the operation is retried, it will know that it just needs to wait for 2 nodes to finish being created instead of starting from scratch and creating 5 more nodes.

Overwriting Profile Properties

Each profile will specify a set of provisioner properties, such as the cloud account information, or the amount of time to retry failed API calls. These properties can be overwritten using preferences, schedule properties, and runtime arguments.

If a preference, schedule property, or runtime argument is prefixed with 'system.profile.properties.', CDAP will strip the prefix and set that property, overwriting any value that may have already been set at a higher level.

Profile properties are overwritten by preferences, which are overwritten by schedule properties and runtime arguments. For example, suppose the provisioner uses a 'retryTimeout' property.

The profile has 'retryTimeout' set to '600'.

This value can be overwritten by setting the a 'system.profile.properties.retryTimeout' preference.

For a scheduled run, both the profile and preference value can be overwritten by setting a value for 'system.profile.properties.retryTimeout' in the schedule properties.

For a manually started run, both the profile and preference value can be overwritten by setting a value for 'system.profile.properties.retryTimeout' in the runtime arguments.

Profile Assignment

Similarly, a profile is assigned to a program run in a hierarchical fashion, similar to how runtime arguments work. At a CDAP level, a default profile is assigned through cdap-site.xml. This default profile can be overridden at a namespace level. Every program can set a profile at configuration time. Every program can also override the profile through preferences. Every schedule can also specify a profile, and runtime arguments can also be used to override a profile. The hierarchy is:

system -> namespace -> program configuration -> app preference -> program preference -> schedule property / runtime argument

Profile Usage Summary

This idea is to support a page like this in the UI:

In order to do this, we need to be able to get all programs assigned to use a given profile, whether it be through manual runs or scheduled runs. We also need to be able to get counters for total successful, failed, killed runs per program and total node hours. In order to fetch this information, we could add a system level metadata property named 'profile', to every program and schedule. The existing metadata search API could then be used to find any entity assigned to a specific profile. Other statistics can be gathered using the Metrics system, as they are just incrementing counters. Note that this does not allow somebody to see programs that ran with the profile in the past but are not assigned that profile in the present. In order to view historical information, one will likely have to go to the operational stats page. This implementation would also require special consideration for system profiles, as metadata is currently namespaced. Also, in order to filter out non-pipelines, pipeline will need to be tagged as pipelines in their metadata.

Node hours will be implemented by periodically emitting metrics. This can be node by the runtime monitor in the cloud environment, as that system will be periodically checking things anyway.

Profile Metadata

To be able to fetch the metadata information about profile, i,e, number of pipelines/schedules related to a profile, node hours, the first thing we want is to be able to get the association between profiles and programs/schedules. Currently the profile information is all stored in Preference Store. We provide the ability to specify profile in a hierarchy model, the profile is propagated through instance → namespace → app → program → schedule argument/runtime argument. Therefore, if we update a preference in higher level, we will have to propagate the metadata to all the program in lower level. For example, update a profile at namespace level will have to update the metadata of all the programs. This will take some time if there are a lot of programs in the namespace. To achieve this, we will have the process running async, which means the correct information will be shown eventually after the process is done.

We asyncly change the metadata to index the profile metadata for each program/schedule
We will send an event to tms if there is a need to update the index. We need to do this after the following scenarios:
- Adding/Updating profile information in Preference Store
  - Index the affected programs/schedules under the level of the entity with updated profile. For example, updating ns level preference should update all programs/schedules in that namespace
- Deleting profile information in Preference Store
  - Remove the deleted profile info from affected entities and make them point to a higher level profile setting if it is there. For example, removing ns level preference should remove profile info in all programs/schedules in that ns and make them point to the setting at instance level if it is there.
- Deploying an app
  - Index the program/schedules in the app with the correct profile(This is needed because we may under the process of async, if we set it directly, it may get modified)
- Deleting an app
  - Remove the profile metadata of the program/schedules(This is needed because even we remove all the metadata about apps and programs when deleting the apps, we may still under the process of the async, so it is possible to add it back in the process)
- Adding/Updating schedules
  - Index the schedule with profile
- Deleting schedules
  - Remove the profile metadata
- For namespace deletion it is already included in the app deletion since we delete the apps one by one
The event will be processed in order to index all entities(programs/schedules) with the updated profile metadata
Since the process is async, incorrect information may be shown during the process but eventually the results should be correct.

API for Profile Metadata and Metrics

We will use the metadata search api to get all the association with the given profile. Currently since the search API is namespaced, so we need to add a system level metadata search api to get metadata from all namespaces for system profiles.

The metadata REST endpoint will be like the following:

For system level:

GET /v3/metadata/search?query=profile:{string representation of profile id}

For namespace level:

GET /v3/namespaces/{namespace-id}/metadata/search?query=profile:{string representation of profile id}

The REST endpoint above will return a list consisting of programs and schedules associated with the profile. UI will need to use the metrics api documented here: https://docs.cask.co/cdap/current/en/reference-manual/http-restful-api/metrics.html#multiple-metrics-with-different-contexts to make a batch call and get the profile metrics, the metric name for program run will be: "system.program.completed.runs", "system.program.failed.runs", "system.program.killed.runs". Node hours will be emitted in unit minute: "system.program.node.minutes". They will be tagged with the program id and the profile id. Profile will have tag name as "pro".

The node minutes is calculated as: program running time * node number. The node hour is accurate to minute, any second will be discarded. For example, if a program has a run with running time 1min 10s with 3 node cluster, the node minutes will be 1min * 3 = 3mins = 0.05hours.

Profile Deletion

We have following restrictions when we delete the profile: 1. The profile status has to be disabled, 2. There must be no associations with any programs/schedules left for that profile. (This can be done using reverse indexing). The detailed step can be described as follows:

Add disable/enable profile functionality
- If a profile is disabled, no entities(program/schedule) can be associated with it. So following operations will fail with a disabled profile:
  - updating preference store
  - deploying an app(which has schedule argument about a disabled profile)
  - updating schedule
  - using runtime argument when a program starts.
- The state change must be sync.
To get whether the profile is associated with any entity, we will have reverse index on preference store
- The profile name comes as system.profile.name → {profile-name}
- We want to have a reverse index of that like system.profile.name.{profile-name} → [entity1, entity2]. For example, if at namespace level ns1, we update system.profile.name → profile1, we will have system.profile.name.profile1 → [ns1] stored. If another preference setting on profile1 comes with program1, we will have system.profile.name.profile1 → [ns1, program1] stored.
- The process must be sync and should be done right after we write to the preference store.
- So to delete the profile, we just need to check if the list is empty.

Idempotent Provisioners and Failure Scenarios

There are several types of failures that CDAP and provisioners need to account for:

Provisioning Service killed during provision
Provisioning Service killed during de-provision
Account quota hit during provision
Cloud services down during provision
Cloud services down during de-provision
Node(s) manually deleted before or during de-provision
Node(s) manually deleted during provision
Provisioner bug causes de-provision to always fail
Data corruption (ex: accidental table truncation) causes provisioner state to be erased before or during a de-provision operation

Provisioners must implement their provision and de-provision methods in an idempotent fashion. This will handle failure scenarios 1-7. For example, a provisioner can implement idempotency by storing state before and after any operation related to a resource is performed. Each resource can have the following lifecycle:

When a provision call is retried, the provisioner should first look up the state for each resource and perform a different action based on the resource state.

If it is in 'Requesting Create', check if the resource exists in the cloud. If not, request the resource. If so, transition to 'Polling Create'. If the Cloud API allows you to specify the resource 'id', this can be done by simply getting that resource. If not, there needs to be a way to list all resources and check if any belong to the particular program run.

If it is in 'Polling Create' or 'Polling Delete', the provisioner just needs to keep polling.

If it is in 'Requesting Delete', check if the status of the resource. If it is not being deleted, request a delete. If it is being deleted, transition to 'Polling Delete'. If it does not exist, transition to 'Requesting Create'.

The de-provision operation can follow a similar lifecycle:

When a de-provision call is retried, the provisioner should first look up the state for each resource and perform a different action based on the resource state.

If it is in 'Created', check the Cloud resource state. If it is deleting, transition to 'Polling Delete'. If it is deleted, transition to 'Deleted'.

Failure Scenario 8 can be fixed with a code fix to the provisioner. Since that may be time consuming, it would be useful to have some manual failsafe in case a program run is stuck in the de-provisioning state due to a bug. This can be done as a manual REST call on a program run that forces a transition from the deprovisioning state into the next state.

Failure Scenario 9 is hard to deal with without leaking resources. For example, suppose provisioner state is erased, leaving several nodes in the cloud that are no longer tied to the cluster for a run. When the de-provision method is run, the provisioner will not find any resources it needs to delete and will keep the nodes in the cloud. This could conceivably be dealt with by a janitor that periodically runs, lists all resources used using the Cloud API, then checks the cluster state of the corresponding program run. We may be able to supply a manual tool that people can run for this purpose, but it seems like overkill to address this type of failure within the normal parameters of the provisioner.

API changes

New Programmatic APIs

Provisioner API

// Any method that throws a RetryableException will be retried by CDAP
// All methods must be implemented in an idempotent way
public interface Provisioner {
 
  ProvisionerSpecification getSpec();
 
  /**
   * Validate provisioner properties. 
   */
  void validate(Map<String, String> config);
  /**
   * Perform a request to create the cluster.
   */
  Cluster requestCreate(ProvisionContext context) throws Exception;
 
  /**
   * Determine what status the create request is in.
   */
  ClusterStatus getCreateStatus(ProvisionContext context, Cluster cluster) throws Exception;
 
  /**
   * Perform a request to delete the cluster.
   */
  void requestDelete(ProvisionContext context, Cluster cluster) throws Exception; 
 
  /**
   * Determine what status the delete request is in.
   */
  ClusterStatus getDeleteStatus(ProvisionContext context, Cluster cluster) throws Exception;
 
}
 
public interface ProvisionContext {
  
  /**
   * Get the program run id.
   */
  ProgramRunId getProgramRunId();
 
  /**
   * Get merged provisioner properties. Properties are taken from the profile provisioner properties, overwritten by
   * any preferences prefixed by 'system.provisioner.', then overwritten by any schedule properties or runtime args
   * prefixed by 'system.provisioner.'. The 'system.provisioner.' prefix will be stripped before placing them in the property map.
   */ 
  Map<String, String> getProperties();
 
  /**
   * Save state for the specified key.
   */
  void saveState(String key, String val);
 
  /**
   * Transactionally save all state in the specified map.
   */
  void saveState(Map<String, String> state);
 
  /**
   * Read the state for the specified key. If none exists, returns null.
   */
  @Nullable
  String getState(String key);
 
  /**
   * Read the state for the specified keys. If no state for the key exists, no corresponding key will
   * be set in the map.
   */
  Map<String, String> getState(Collection<String> keys);
 
  /**
   * Delete state for the specified key.
   */
  void deleteState(String key);
 
  /**
   * Delete state for the specified keys.
   */
  void deleteState(Collection<String> keys);
}
 
public enum ClusterStatus {
  IN_PROGRESS,
  COMPLETE,
  ERROR
}
 
public class Cluster {
  private final Collection<Node> nodes;
  private final Map<String, String> properties;
}
 
public class Node {
  private final String id;
  private long createtime;
  private final Map<String, String> properties;
}
 
public class ProvisionerSpecification {
  private final String name;
  private final String label;
  private final String description;
}

Note that both the provision and deProvision methods must be implemented in an idempotent way, otherwise there may be resource leaks or failures on retry.

Runtime Provider Extensions

We will use the same extension framework that we have in place for security and operational metrics to make the runtime providers pluggable. Any jar placed in the ext/runtimeproviders directory will show up in the list of runtime providers. In addition, each provisioner can specify a json file in its directory that will be returned in the rest api and control the formatting of the provisioner properties. The json file is exactly the same format as the widget json for pipeline plugins. It is expected that the json file with be named after the provisioner. For example, if the provisioner is named 'GCP-DataProc', the json file should be named 'GCP-DataProc.json'.

Deprecated Programmatic APIs

N/A

New REST APIs

Path	Method	Description	Request Params	Request Body	Response Code	Response
/v3/namespaces/{namespace}/profiles	GET	Returns the list of profiles in a namespace.	includeSystem: whether to include system profiles. Defaults to false.	none	200 - On success	[ { "name": "MyProfile", "description": "...", "scope": "USER" \| "SYSTEM", "status": "ENABLED" \| "DISABLED", "keepalive": { "killed": time in seconds, "failed": time in seconds }, "timeout": time in seconds, "provisioner": { "name": "GoogleDataProc", "properties": { "projectId": "...", ... } } }, ... ]
/v3/namespaces/{namespace}/profiles/{profile}	GET	Returns profile information		none	200 - On success 404 - Profile doesn't exist	{ "name": "MyProfile", "description": "...", "scope": "USER" \| "SYSTEM", "status": "ENABLED" \| "DISABLED", "keepalive": { "killed": time in seconds, "failed": time in seconds }, "timeout": time in seconds, "provisioner": { "name": "GoogleDataProc", "properties": [ { "name": "projectId", "value": "...", "editable": true \| false // default is true } ] } }
/v3/namespaces/{namespace}/profiles/{profile}	PUT	Write profile		{ "description": "...", "keepalive": { "killed": time in seconds, "failed": time in seconds }, "timeout": time in seconds, "provisioner": { "name": "GoogleDataProc", "properties": [ { "name": "projectId", "value": "...", "editable": true }, ... ] } }	200 - On success 400 - Bad profile (ex: invalid provisioner properties)
/v3/namespaces/{namespace}/profiles/{profile}	DELETE	Delete profile. A profile must be in the disabled state before it can be deleted. Before a profile can be deleted, it cannot be assigned to any program or schedule, and it cannot be in use by any running program.		none	200 - On success 404 - Profile doesn't exist 409 - Program using profile still exists
/v3/namespaces/{namespace}/profiles/{profile}/disable	POST	Disable the profile, so that no new program runs can use it, and no new schedules/programs can be assigned to it.		none
/v3/namespaces/{namespace}/profiles/{profile}/enable	POST	Enable the profile		none
/v3/provisioners	GET	List provisioner		none	200 - On success	[ { "name": "Google DataProc", "description": "Provisioner using Google DataProc", "configuration-groups": [ same as plugin widget json ] } ]
/v3/provisioners/{provisioner}	GET	Get provisioner details		none	200 - On success 404 - Provisioner doesn't exist	{ "name": "Google DataProc", "description": "Provisioner using Google DataProc", "configuration-groups": [ same as plugin widget jsons ] }
/v3/namespaces/{namespace}/apps/{app}/{programtype}/{program}/runs/{run}	GET	Get program run information. Enhanced to include cluster state and expiry time (if applicable)				{ ..., "cluster": { "status": "provisioning" \| "provisioned" \| "deprovisioning" \| "deprovisioned", "expiresAt": timestamp, "nodes": [ { "id": node-id, "createtime": timestamp, "properties": { ... } }, ... ] } }
/v3/namespaces/{namespace}/apps/{app}/{programtype}/{program}/runs/{run}/stop	POST	Stop a program, optionally waiting to deprovision.		{ // defaults to false "keepalive": true }	200 - On Success 409 - Run is not in a stoppable state
/v3/namespaces/{namespace}/apps/{app}/{programtype}/{program}/runs/{run}/deprovision	POST	Deprovision nodes for a run			200 - On Success 409 - Run is not in 'waiting' state
/v3/namespaces/{namespace}/apps/{app}/{programtype}/{program}/runs/{run}/extend	POST	Extend the TTL for a run		{ // timestamp that the cluster should expire at "expireAt": timestamp }	200 - On Success 409 - Run is not in 'waiting' state

Deprecated REST API

N/A

CLI Impact or Changes

Add profile commands - list, get, create, edit, delete

UI Impact or Changes

Profile management

Security Impact

Profiles will require users to provide secret information. This should be stored securely somehow.

Impact on Infrastructure Outages

Adds an additional dependency on external cloud services.

Test Scenarios

Test ID	Test Description	Expected Results

Releases

Release 5.0.0

Error rendering macro 'jira' : Unable to locate Jira server for this macro. It may be due to Application Link configuration.

Release 5.1.0

Related Work

Work #1
Work #2
Work #3

CDAP

CDAP Provisioning