Info | ||
---|---|---|
| ||
This page is work in progress |
...
If a program takes a longer time than anticipated (rogue workflows), we can pinpoint where the problem is, instead of going over a lot of unnecessary logs. As a System Admin, this would be useful to find which jobs were causing delays in times. As a developer, it will be useful to see what parts of the workflow took how much time, and trying to parallelize/optimize those aspects of the workflow. From a scheduling standpoint, a user can figure out the duration of the average/ worst case runs and set the frequencies of the jobs accordingly. Setting a common standard to show metrics of MR jobs/ Spark at the workflow level will help users understand and analyze their runs better.
User Stories
Number | User Story | Priority |
1 | User wants to find what runs of a workflow have experienced delays in meeting SLA | H |
2 | User wants to see what are the stats for a workflow | H |
3 | User wants to know what were the reason behind the delay of certain action/run | H |
4 | User wants to make future resource allocation decisions based on historical performance of past runs | M |
5 | User wants to see a common metric across various action types to make things look uniform | L |
6 | User wants to see common aggregations across all runs of a workflow (see below) | H |
7 | User wants to see statistics across actions in a workflow | M |
Design
WorkflowSlaStatsHandler is a part of the App-Fabric
...
The second endpoint gives the user the ability to dig deeper into investigating what the cause was for runs to take unusually long. The /stats endpoint will, for example, return a list of all run-ids that were greater than the 99th percentile. Using those run-ids, we can dig deeper into analyzing and seeing the difference of a particular run against the normal runs of a workflow.
This endpoint provides the user the ability to configure at fine-grained level, the count of runs before and after the current run of interest and also sampling time interval. Let's say the user made a request with count=3 and time_interval=1day, we would return 7 run_ids in the result, 3 runs before the interested run and 3 after, each evenly spaced at 1 day interval.
The details will be collected from MRJobInfoFetcher, MetricsStore and MDS.
FUTURE IMPROVEMENT: Naively, this will just return a evenly spaced sampling from the interval but we could optimize it to provide those results from the range which are close to the average so that the user does not end up seeing only abnormal runs.
...
col: node_stats example: [{ program_name, runId, timeTaken_seconds}, { program_name, runId
timeTaken_seconds}]
CDAP Metrics System
All the metrics from Spark are emitted to the CDAP Metrics System
User metrics and job-level metrics for Map-reduce job are emitted to CDAP Metrics System
Metrics for custom-action and conditions nodes aren’t available right now, but they will be emitted to CDAP Metrics System when implemented
CDAP Metrics System has metrics aggregated by multiple resolutions {seconds, minutes, hours} - when querying for long-running batch program metrics , it will be useful and efficient to use minutes or hours resolutions to get higher-level details and reduces data points.
since we depend on CDAP Metrics System heavily, CDAP should be operational with the metrics.processor and metrics.query services running.
...
CDAP Services are running.
Dataset service is running.
User Interface
curl calls
Implementation Plan
...