Monitoring Instance and Pipelines
This article is posted on the CDAP Doc wiki and will be maintained here: Monitoring Instance and Pipelines
This document describes some of the key services and metrics to monitor for Data Fusion instance and pipelines.
Data Fusion System Health
Data Fusion has fine grained services for each of the functional aspects of the system. These system services run on GKE pods and are self healing. Any outage over long periods of time (> 5 mins) will impact the functioning of the instance.
Use the following REST API to get the status of all servicess
GET /v3/system/services/status
The response body will contain a JSON-formatted list of the existing system services:
{
"messaging.service": "OK",
"metrics.processor": "OK",
"appfabric": "OK",
"runtime": "OK",
"dataset.executor": "OK",
"metadata.service": "OK",
"metrics": "OK",
"log.saver": "OK"
}
Data Fusion emits a number of metrics that can be used to monitor critical aspects of pipeline execution. The metrics can be fetched at various levels
Namespace level
Pipeline (or App) level
Pipeline run level
Metrics end point reference can be found here. It is recommended to monitor the metrics at pipeline level. As an example to query warnings coming from a pipeline called Customer_DB_To_BQ which queries a metrics called system.app.log.warn
for the last 1 day, the following end point should be used.
POST /v3/metrics/query?target=metric&tag=namespace:default&tag=app:Customer_DB_To_BQ&
metric=system.app.log.warn&start=now-1d&end=now
Note: The tag app
refers to the pipeline. Each pipeline corresponds to an CDAP Application with the same name.
Pipeline Status
The following metrics are useful to measure for each pipeline run
system.program.completed.runs
- Number of successful runs of pipelinessystem.program.failed.runs
- Number of failed runs of the pipelinessystem.program.killed.runs
- Number of runs of the pipelines that are manually stopped
Pipeline Logs
Each pipeline run emits logs and the number of warnings, errors are collected and stored in the metric system. This should give a quick indication of errors in the pipelines and can be correlated with failed runs of the pipelines to get an overview of the pipeline runs.
system.app.log.debug
- Number of debug messages in a pipelinesystem.app.log.error
- Number of error messages in a pipelinesystem.app.log.info
- Number of info messages in a pipelinesystem.app.log.warn
- Number of warning messages in a pipeline
Note: For the metrics above in addition to monitoring at pipeline level, measuring the pipeline at run level is recommended.
Pipeline Execution
For each pipeline run duration of pipeline run and the node hours of Dataproc is computed and stored in the following metrics
system.program.run.seconds
- Time in seconds for a pipeline runsystem.program.node.minutes
- Node minutes for the Dataproc clustersystem.program.provisioning.delay.seconds
- Time taken to provision a Dataproc cluster
Note: For the metrics above in addition to monitoring at pipeline level, measuring the pipeline at run level is recommended.
Note: system.program.node.minutes
metrics works well for a dynamic created (ephemeral) Dataproc cluster which is the default mode to run pipelines. If a static Dataproc cluster is configured the the node minutes is computed only for the duration of run for a single node in the cluster.