Checklist
- Design Reviewed
- APIs reviewed
- Release priorities assigned
- Test cases reviewed
- Blog post
Introduction
Runtime monitor will monitor and collect program states, metadata, lineage, workflow token etc from Runtime(s).
Terminology
Runtime - A single run of CDAP Application in cloud. Each Runtime will have its own instance of TMS.
Runtime Monitor - Monitoring component to collect monitoring data for Runtime(s).
Heartbeat Handler - Handler to expose REST apis for Runtime Monitor.
Approaches
Approach #1
In order to collect all the monitoring data, Runtime Monitor will poll heartbeat messages from Heartbeat Handler periodically using single rest endpoint.
Design:
- On startup of a Runtime, Program Launcher will notify Runtime Monitor to monitor a given Runtime.
- Runtime Monitor will poll for batch of messages from all the active Runtimes it is monitoring.
- This can be done in parallel by Multiple threads running inside Runtime Monitor. When load is increased, Runtime Monitor can have multiple instances for load balancing based on number of active Runtime instances.
- In order to distinguish between messages from different Runtimes, we will add runtimeId as a prefix
- All the Runtime Monitor instances will share same persisted offsets.
- When an application run is not longer active (completed/killed/failed),
- Heartbeat Handler will fetch heartbeat messages from topic using last persisted offset provided by Runtime Monitor
- Heartbeat Handler will gather all the heartbeat messages and sends in a batch to Runtime Monitor along with processed offsets for each topic.
- If the Runtime Monitor fails, it will start from last persisted offset for each topic and ask for heartbeat messages after that.
- If the Runtime Monitor fails while it is making changes to the corresponding stores, it may reprocess some heartbeat messages depending on what last offset is.
Questions
- Interface with program launcher to get RuntimeId to monitor, discovery?
Pros:
- Less number of http requests.
- Having single rest endpoint would reduce number of requests handled by web server running in Heartbeat Handler.
- Adding more monitoring data would be easier since we do not need rest endpoint for each type of data we collect
Cons:
- Load balance among all the topics such that recent information needs to be provided to Runtime Monitor with very little delay. Should we read same number of messages from each topic? Or we can have only one topic for all the monitoring data and read messages from there.
Approach #2
In order to collect all the monitoring data, Runtime Monitor will poll heartbeat messages from Heartbeat Handler periodically using multiple rest endpoints:
Design Details:
- Runtime Monitor polls for next batch of heartbeat messages along with last persisted offset for each topic. Depending on implementation can be done serially or simultaneously for each topic.
- Heartbeat Handler will fetch heartbeat messages from topic using last persisted offset provided by Runtime Monitor
- Heartbeat Handler will gather all the heartbeat messages and sends in a batch to Runtime Monitor along with processed offsets for each topic.
- If the Runtime Monitor fails, it will start from last persisted offset for each topic and ask for heartbeat messages after that.
- If the Runtime Monitor fails while it is making changes to the corresponding stores, it may reprocess some heartbeat messages depending on what last offset is.
Pros:
- We can poll for more data concurrently from different topics which will reduce delay.
Cons:
- More number of http requests
- At a time, more number of requests handled by web server running in Heartbeat Handler if the implementation is parallel.
- Adding more monitoring data would mean adding more rest endpoints to Runtime Handler and more http requests to web server.
API changes
New Programmatic APIs
New Java APIs introduced (both user facing and internal)
Deprecated Programmatic APIs
New REST APIs
Path | Method | Description | Request Body | Response Code | Response |
---|---|---|---|---|---|
/v3/namespaces/{namespace}/programs/status | GET | Returns list of status messages for all the programs for a given namespace | batchsize, start_offset | 200 - On success 204 - No content 500 - Any internal errors | |
Deprecated REST API
Path | Method | Description |
---|---|---|
/v3/apps/<app-id> | GET | Returns the application spec for a given application |
CLI Impact or Changes
- Impact #1
- Impact #2
- Impact #3
UI Impact or Changes
- Impact #1
- Impact #2
- Impact #3
Security Impact
What's the impact on Authorization and how does the design take care of this aspect
Impact on Infrastructure Outages
System behavior (if applicable - document impact on downstream [ YARN, HBase etc ] component failures) and how does the design take care of these aspect
Test Scenarios
Test ID | Test Description | Expected Results |
---|---|---|
Releases
Release X.Y.Z
Release X.Y.Z
Related Work
- Work #1
- Work #2
- Work #3