Checklist

Design Reviewed
APIs reviewed
Release priorities assigned
Test cases reviewed
Blog post

Introduction

Runtime monitor will monitor and collect program states, metadata, lineage, workflow token etc from Runtime(s).

Terminology

Runtime - A single run of CDAP Application in cloud. Each Runtime will have its own instance of TMS.

Runtime Monitor - Monitoring component to collect monitoring data for Runtime(s).

Heartbeat Handler - Handler to expose REST apis for Runtime Monitor.

Approaches

Approach #1

In order to collect all the monitoring data, Runtime Monitor will poll heartbeat messages from Heartbeat Handler periodically using single rest endpoint.

Design:

On startup of a Runtime, Program Launcher will notify Runtime Monitor to monitor a given Runtime.
Runtime Monitor will get list of topics to monitor from Heartbeat Handler. This is needed for the case where a Runtime is running while cdap restarts with different cdap-site.xml and topic names have been modified. We do not support dynamic configurations. So runtime will be unaware of these changes and will publish to same topics. So runtime monitor should monitor only those topics which are being used by a runtime.
Runtime Monitor Service will poll for batch of messages from all the active Runtimes it is monitoring.
1. Each Runtime Monitor is a single thread polling an endpoint for metadata across different topics. When load is increased, Runtime Monitor Service can have multiple instances for load balancing based on number of active Runtime instances.
2. All the Runtime Monitor instances will share same persisted offsets.
3. Runtime Monitor will maintain separate offset for each Runtime.
4. Runtime Monitor will retry polling periodically if Heartbeat Handler is not available
5. Publishing Notification about program state on CDAP TMS should be transactional.
Heartbeat Handler will fetch messages from list of topics using last persisted offset provided by Runtime Monitor.
Heartbeat Handler will buffers all the messages and sends in a batch to Runtime Monitor along with last processed offsets for each topic.
1. If the Runtime Monitor fails, it will start from last persisted offset for each topic and ask for heartbeat messages after that.
2. If the Runtime Monitor fails while it is making changes to the corresponding stores, it may reprocess some heartbeat messages depending on what last offset is.
After Runtime becomes inactive, all the offsets persisted for that runtime should be cleaned up.
1. Assuming Runtime Monitor will be notified about inactive runtime. And based on this notification, Monitor can mark that entry in the offset table as deleted.
2. Another janitor process which runs only on first instance of Monitor will delete all the deleted runtime entries in batches.

Questions

Interface with program launcher to get RuntimeId to monitor
How will monitor get notification about a runtime being inactive?

Pros:

Less number of http requests.
Having single rest endpoint would reduce number of requests handled by web server running in Heartbeat Handler.
Adding more monitoring data would be easier since we do not need rest endpoint for each type of data we collect

Cons:

Load balance among all the topics such that recent information needs to be provided to Runtime Monitor with very little delay. Should we read same number of messages from each topic? Or we can have only one topic for all the monitoring data and read messages from there.

Approach #2

In order to collect all the monitoring data, Runtime Monitor will poll heartbeat messages from Heartbeat Handler periodically using multiple rest endpoints:

Design Details:

Runtime Monitor polls for next batch of heartbeat messages along with last persisted offset for each topic. Depending on implementation can be done serially or simultaneously for each topic.
Heartbeat Handler will fetch heartbeat messages from topic using last persisted offset provided by Runtime Monitor
Heartbeat Handler will gather all the heartbeat messages and sends in a batch to Runtime Monitor along with processed offsets for each topic.
1. If the Runtime Monitor fails, it will start from last persisted offset for each topic and ask for heartbeat messages after that.
2. If the Runtime Monitor fails while it is making changes to the corresponding stores, it may reprocess some heartbeat messages depending on what last offset is.

Pros:

We can poll for more data concurrently from different topics which will reduce delay.

Cons:

More number of http requests
At a time, more number of requests handled by web server running in Heartbeat Handler if the implementation is parallel.
Adding more monitoring data would mean adding more rest endpoints to Runtime Handler and more http requests to web server.

API changes

New Programmatic APIs

New Java APIs introduced (both user facing and internal)

Deprecated Programmatic APIs

New REST APIs

Path

Method

Description

Request Body

Response Code

Response

/v1/runtime/metadata

POST

Returns runtime status, metadata, lineage information

{
	"type" : "map",
	"values" : {
		"type" : "record",
		"name" : "MonitorConsumeRequest",
		"fields" : [
      		{ "name" : "messageId", "type" : "string" },
     		{ "name" : "limit", "type" : "int" }
     	]
	}
}

200 - On success

500 - Any internal errors

{
	"type" : "map",
	"values" : {
		"type" : "array",
	  	"items" : {
	    	"type" : "record",
	    	"name" : "MonitorMessages",
	    	"fields" : [
	      		{ "name" : "id", "type" : "string" },
	      		{ "name" : "payload", "type" : "bytes" }
	    	]
  		}
	}
}

/v1/runtime/shutdown

POST

shutdown the Runtime Handler

200 - On success

/v1/runtime/program/kill

POST

kill the running program

Deprecated REST API

Path	Method	Description
/v3/apps/<app-id>	GET	Returns the application spec for a given application

CLI Impact or Changes

Impact #1
Impact #2
Impact #3

UI Impact or Changes

Impact #1
Impact #2
Impact #3

Security Impact

What's the impact on Authorization and how does the design take care of this aspect

Impact on Infrastructure Outages

System behavior (if applicable - document impact on downstream [ YARN, HBase etc ] component failures) and how does the design take care of these aspect

Test Scenarios

Test ID	Test Description	Expected Results

CDAP

Runtime Monitor

Introduction

Terminology

Approaches

Approach #1

Design:

Questions

Pros:

Cons:

Approach #2

Design Details:

Pros:

Cons:

API changes

New Programmatic APIs

Deprecated Programmatic APIs

New REST APIs

Deprecated REST API

CLI Impact or Changes

UI Impact or Changes

Security Impact

Impact on Infrastructure Outages

Test Scenarios

Releases

Release X.Y.Z

Release X.Y.Z

Related Work

Future work