Design

Image Removed

WorkflowSlaStatsHandler is a part of the App-Fabric

On calling the /stats endpoint

We retrieve a list of all completed Workflow Runs from the Workflow Stats Dataset within the provided time range.
Using this list we calculate the statistics(percentiles requested by user) for a workflow and all its nodes and then return those stats to the user.

The second endpoint gives the user the ability to dig deeper into investigating what the cause was for runs to take unusually long. The /stats endpoint will, for example, return a list of all run-ids that were greater than the 99th percentile. Using those run-ids, we can dig deeper into analyzing and seeing the difference of a particular run against the normal runs of a workflow.

This endpoint provides the user the ability to configure at fine-grained level, the count of runs before and after the current run of interest and also sampling time interval. Let's say the user made a request with count=3 and time_interval=1day, we would return 7 run_ids in the result, 3 runs before the interested run and 3 after, each evenly spaced at 1 day interval.
The details will be collected from MRJobInfoFetcher, MetricsStore and MDS.
FUTURE IMPROVEMENT: Naively, this will just return a evenly spaced sampling from the interval but we could optimize it to provide those results from the range which are close to the average so that the user does not end up seeing only abnormal runs.

The third endpoint gives us the ability to compare 2 runs of a workflow side-by-side. This ability comes in handy when the user has 2 runs, one which is normal, and the other which is an outlier and compare them to see why the outlier run turned out that way.

Workflow Analytics Service

This service should be a system service, with the ability to be turned on or off as the user desires.

...

Handler Endpoint

...

Response

...

Desc

...

app/{app-id}/workflows/{workflow-id}/statistics?start={st}&end={end}&percentile=80&percentile=90&percentile=95&percentile=99

...

{

"startTime" : 1439200000,

"endTime" : 1439300000,

"total_successful_runs" : 100,

"total_unsuccessful_runs" : 10,

"avg_runTime": 30,

"median_runtime": 15,

"percentile_completion" : [

{"80": 20},

{"90": 28},

{"95": 40},

{"99": 45}

],

"runs_above_Percentile": [

{"80": ["32af2-44", "54xxasdss", "..."]},

{"90": ["234asd23", "..."]},

{"95": ["65fgdf7657", "..."]},

{"99": ["84xv6985", "..."]}

],

"nodes": [{

"node1": {

"name": "PurchaseEventParser",

"type": "MapReduce",

"avg_run_time_seconds": 12,

"80_percentile_seconds": 18,

"90_percentile_seconds": 30,

"95_percentile_seconds": 40,

"99_percentile_seconds": 50

}},

{"node2": {

"name": "PurchaseAnalyser",

"type": "Spark",

"avg_run_time_seconds": 17,

"80_percentile_seconds": 18,

"90_percentile_seconds": 30,

"95_percentile_seconds": 40,

"99_percentile_seconds": 50

}

}]

}

...

returns basic stats about the workflow program across all runs. This will help us in detecting which jobs might be responsible for delays.

...

app/{app-id}/workflows/{workflow-id}/runs/{run-id}/stats?start={st}&end={end}&count={count}&time_interval_sec={H}

...

{

"start_time": 1439200000,

"end_time": 1439279804,

"num_runs": 20,

"nodes": {

"node1" : {

"name": "PurchaseEventParser",

"p2": { "runid" : "13fdssvqe423",

"run_times_seconds": 15,

"mappers": 4,

"reducers": 3,

"file_read_data_bytes": 800,

"file_write_data_bytes": 900,

"file_read_data_ops": 80,

"file_write_data_ops": 90,

"file_large_read_data_ops": 8,

"hdfs_read_data_bytes": 700,

"hdfs_write_data_bytes": 800,

"hdfs_read_data_ops": 70,

"hdfs_write_data_ops": 80,

"hdfs_read_data_ops": 5,

"total_time_spnt_map_tasks_seconds": 100,

"total_time_spnt_red_tasks_seconds": 140

},

"p1": {

"runid" : "245fdasfsdaw5ee",

"run_time_seconds": 20,

"mappers": 7,

"reducers": 8,

"file_read_data_bytes": 800,

"file_write_data_bytes": 900,

"file_read_data_ops": 80,

"file_write_data_ops": 90,

"file_large_read_data_ops": 8,

"hdfs_read_data_bytes": 700,

"hdfs_write_data_bytes": 800,

"hdfs_read_data_ops": 70,

"hdfs_write_data_ops": 80,

"hdfs_read_data_ops": 5,

"total_time_spnt_by_map_tasks_seconds": 11,

"total_time_spnt_by_red_tasks_seconds": 9

},

"current":{},

"n1": {},

"n2": {}

},

"node2" : {"name": "PurchaseSpark",

"p2": {},

"p1": {},

"curr": {},

"n1": {},

"n2": {}

}

get stats for each stage for a run of a workflow and compares it to + or - count runs that came in the time window provided. The time interval specified allows the previous or next runs to space out in give range.

Reference -

p{x} is x runs before the current run

...

Info

title	Status: INCOMPLETE

This page is work in progress

Table of Contents

Motivation

If a program takes a longer time than anticipated (rogue workflows), we can pinpoint where the problem is, instead of going over a lot of unnecessary logs. As a System Admin, this would be useful to find which jobs were causing delays in times. As a developer, it will be useful to see what parts of the workflow took how much time, and trying to parallelize/optimize those aspects of the workflow. From a scheduling standpoint, a user can figure out the duration of the average/ worst case runs and set the frequencies of the jobs accordingly. Setting a common standard to show metrics of MR jobs/ Spark at the workflow level will help users understand and analyze their runs better.

User Stories

...

Number

...

User Story

...

Priority

...

1

...

User wants to find what runs of a workflow have experienced delays in meeting SLA

...

H

...

2

...

User wants to see what are the stats for a workflow

...

H

...

3

...

User wants to know what were the reason behind the delay of certain action/run

...

H

...

4

...

User wants to make future resource allocation decisions based on historical performance of past runs

...

M

...

5

...

User wants to see a common metric across various action types to make things look uniform

...

L

...

6

...

User wants to see common aggregations across all runs of a workflow (see below)

...

H

...

7

...

User wants to see statistics across actions in a workflow

...

M

Info

title	Status: INCOMPLETE

This page is work in progress

Table of Contents

Motivation

If a program takes a longer time than anticipated (rogue workflows), we can pinpoint where the problem is, instead of going over a lot of unnecessary logs. As a System Admin, this would be useful to find which jobs were causing delays in times. As a developer, it will be useful to see what parts of the workflow took how much time, and trying to parallelize/optimize those aspects of the workflow. From a scheduling standpoint, a user can figure out the duration of the average/ worst case runs and set the frequencies of the jobs accordingly. Setting a common standard to show metrics of MR jobs/ Spark at the workflow level will help users understand and analyze their runs better.

User Stories

Number	User Story	Priority
1	User wants to find what runs of a workflow have experienced delays in meeting SLA	H
2	User wants to see what are the stats for a workflow	H
3	User wants to know what were the reason behind the delay of certain action/run	H
4	User wants to make future resource allocation decisions based on historical performance of past runs	M
5	User wants to see a common metric across various action types to make things look uniform	L
6	User wants to see common aggregations across all runs of a workflow (see below)	H
7	User wants to see statistics across actions in a workflow	M

Design

Image Added

WorkflowSlaStatsHandler is a part of the App-Fabric

On calling the /stats endpoint

We retrieve a list of all completed Workflow Runs from the Workflow Stats Dataset within the provided time range.
Using this list we calculate the statistics(percentiles requested by user) for a workflow and all its nodes and then return those stats to the user.

The second endpoint gives the user the ability to dig deeper into investigating what the cause was for runs to take unusually long. The /stats endpoint will, for example, return a list of all run-ids that were greater than the 99th percentile. Using those run-ids, we can dig deeper into analyzing and seeing the difference of a particular run against the normal runs of a workflow.

This endpoint provides the user the ability to configure at fine-grained level, the count of runs before and after the current run of interest and also sampling time interval. Let's say the user made a request with count=3 and time_interval=1day, we would return 7 run_ids in the result, 3 runs before the interested run and 3 after, each evenly spaced at 1 day interval.
The details will be collected from MRJobInfoFetcher, MetricsStore and MDS.
FUTURE IMPROVEMENT: Naively, this will just return a evenly spaced sampling from the interval but we could optimize it to provide those results from the range which are close to the average so that the user does not end up seeing only abnormal runs.

The third endpoint gives us the ability to compare 2 runs of a workflow side-by-side. This ability comes in handy when the user has 2 runs, one which is normal, and the other which is an outlier and compare them to see why the outlier run turned out that way.

Workflow Analytics Service

This service should be a system service, with the ability to be turned on or off as the user desires.

Handler Endpoint	Response	Desc
app/{app-id}/workflows/{workflow-id}/statistics?start={st}&end={end}&percentile={p1}&percentile={p2}&percentile={p3}	{ "startTime": 0, "endTime": 1442008469, "runs": 4, "avgRunTime": 7.5, "percentileInformationList": [ { "percentile": 80.0, "percentileTimeInSeconds": 9, "runIdsOverPercentile": [ "e0cc5b98-58cc-11e5-84a1-8cae4cfd0e64" ] }, { "percentile": 90.0, "percentileTimeInSeconds": 9, "runIdsOverPercentile": [ "e0cc5b98-58cc-11e5-84a1-8cae4cfd0e64" ] }, { "percentile": 95.0, "percentileTimeInSeconds": 9, "runIdsOverPercentile": [ "e0cc5b98-58cc-11e5-84a1-8cae4cfd0e64" ] }, { "percentile": 99.0, "percentileTimeInSeconds": 9, "runIdsOverPercentile": [ "e0cc5b98-58cc-11e5-84a1-8cae4cfd0e64" ] } ], "nodes": { "PurchaseHistoryBuilder": { "avgRunTime": "7.0", "99.0": "8", "80.0": "8", "95.0": "8", "runs": "4", "90.0": "8", "type": "MapReduce" } } }	returns basic stats about the workflow program across all runs. This will help us in detecting which jobs might be responsible for delays.
app/{app-id}/workflows/{wfworkflow-id}/runs/{run-id}/compare?other_run={run-id-2} [{"node1":[ {"run1": { "name": "PurchaseEventParser", "type": "MapReduce", "runid" : "13fdssvqe423", "run_time_seconds": 15, "mappers": 4, "reducers": 3, "file_read_data_bytes": 800, "file_write_data_bytes": 900, "file_read_data_ops": 80, "file_write_data_ops": 90, "file_large_read_data_ops": 8, "hdfs_read_data_bytes": 700, "hdfs_write_data_bytes": 800, "hdfs_read_data_ops": 70, "hdfs_write_data_ops": 80, "hdfs_read_data_ops": 5, "total_time_spnt_by_map_tasks_seconds": 100, "total_time_spnt_by_red_tasks_seconds": 140 }}, {"run2": { "name": "PurchaseEventParser", "type": "MapReduce", "run_time_seconds": 50, "mappers": 10, "reducers": 10, "file_read_data_bytes": 800, "file_write_data_bytes": 900, "file_read_data_ops": 80, "file_write_data_ops": 90, "file_large_read_data_ops": 8, "hdfs_read_data_bytes": 700, "hdfs_write_data_bytes": 800, "hdfs_read_data_ops": 70, "hdfs_write_data_ops": 80, "hdfs_read_data_ops": 5, "total_time_spnt_by_map_tasks_seconds": 20, "total_time_spnt_by_red_tasks_seconds": 30 }} ]}, {"node2":[ {"run1": { "name": "PurchaseSpark", "type": "Spark", "executors" : 2, "driver_running_stages": 4, "driver_mem_used_megabytes": 17, "driver_disk_space_used_megabytes": 12, "driver_all_jobs": 7, "driver_failed_stages": 2, "filesystem.file_write_ops": 1500, "filesystem.file_read_ops": 1000, "filesystem.file_large_read_ops": 100, "filesystem.file_write_bytes": 15000, "filesystem.file_read_bytes": 10000, "hdfs.file_write_ops": 1500, "hdfs.file_read_ops": 1000, "hdfs.file_large_read_ops": 100, "hdfs.file_write_bytes": 15000, "hdfs.file_read_bytes": 10000, "current_pool_size": 5, "max_pool_size": 8 }}, {"run2" : {}}] } stats?start={st}&end={end}&liimt={limit}&interval={interval}	{ "startTimes": { "1dd36962-58d9-11e5-82ac-8cae4cfd0e64": 1442012523, "2523aa44-58d9-11e5-90fd-8cae4cfd0e64": 1442012535, "1873ade0-58d9-11e5-b79d-8cae4cfd0e64": 1442012514 }, "programNodesList": [ { "programName": "PurchaseHistoryBuilder", "workflowProgramDetailsList": [ { "workflowRunId": "1dd36962-58d9-11e5-82ac-8cae4cfd0e64", "programRunId": "1e1c3233-58d9-11e5-a7ff-8cae4cfd0e64", "programRunStart": 1442012524, "metrics": { "MAP_INPUT_RECORDS": 19, "REDUCE_OUTPUT_RECORDS": 3, "timeTaken": 9, "MAP_OUTPUT_BYTES": 964, "MAP_OUTPUT_RECORDS": 19, "REDUCE_INPUT_RECORDS": 19 } }, { "workflowRunId": "1873ade0-58d9-11e5-b79d-8cae4cfd0e64", "programRunId": "188a9141-58d9-11e5-88d1-8cae4cfd0e64", "programRunStart": 1442012514, "metrics": { "MAP_INPUT_RECORDS": 19, "REDUCE_OUTPUT_RECORDS": 3, "timeTaken": 7, "MAP_OUTPUT_BYTES": 964, "MAP_OUTPUT_RECORDS": 19, "REDUCE_INPUT_RECORDS": 19 } } ], "programType": "Mapreduce" } ] }	get stats for each stage for a run of a workflow and compares it to + or - count runs that came in the time window provided. The time interval specified allows the previous or next runs to space out in give range. Reference - p{x} is x runs before the current run n{x} is x runs after the current run
app/{app-id}/workflows/{wf-id}/runs/{run-id}/compare?other-run-id={run-id-2}	[ { "programName": "PurchaseHistoryBuilder", "workflowProgramDetailsList": [ { "workflowRunId": "14b8710a-58cd-11e5-98ca-8cae4cfd0e64", "programRunId": "14c9d62b-58cd-11e5-9105-8cae4cfd0e64", "programRunStart": 1442007354, "metrics": { "MAP_INPUT_RECORDS": 19, "REDUCE_OUTPUT_RECORDS": 3, "timeTaken": 7, "MAP_OUTPUT_BYTES": 964, "MAP_OUTPUT_RECORDS": 19, "REDUCE_INPUT_RECORDS": 19 } }, { "workflowRunId": "e0cc5b98-58cc-11e5-84a1-8cae4cfd0e64", "programRunId": "e1497ad9-58cc-11e5-9dfa-8cae4cfd0e64", "programRunStart": 1442007268, "metrics": { "MAP_INPUT_RECORDS": 19, "REDUCE_OUTPUT_RECORDS": 3, "timeTaken": 8, "MAP_OUTPUT_BYTES": 964, "MAP_OUTPUT_RECORDS": 19, "REDUCE_INPUT_RECORDS": 19 } } ], "programType": "Mapreduce" } ]	Compare 2 runs of a workflow side by side

...

Versions Compared

Old Version 15

New Version 16

Key

Design

Workflow Analytics Service

Motivation

User Stories

Motivation

User Stories

Design

Workflow Analytics Service

Page Comparison

Versions Compared

Old Version 15

New Version 16

Key

Design

Workflow Analytics Service

Motivation

User Stories

Motivation

User Stories

Design

Workflow Analytics Service