Monitoring Instance and Pipelines

This article is posted on the CDAP Doc wiki and will be maintained here: Monitoring Instance and Pipelines

This document describes some of the key services and metrics to monitor for Data Fusion instance and pipelines.

Data Fusion System Health

Data Fusion has fine grained services for each of the functional aspects of the system. These system services run on GKE pods and are self healing. Any outage over long periods of time (> 5 mins) will impact the functioning of the instance.

Use the following REST API to get the status of all servicess

GET /v3/system/services/status

The response body will contain a JSON-formatted list of the existing system services:

{
  "messaging.service": "OK",
  "metrics.processor": "OK",
  "appfabric": "OK",
  "runtime": "OK",
  "dataset.executor": "OK",
  "metadata.service": "OK",
  "metrics": "OK",
  "log.saver": "OK"
}

Data Fusion emits a number of metrics that can be used to monitor critical aspects of pipeline execution. The metrics can be fetched at various levels

Namespace level
Pipeline (or App) level
Pipeline run level

Metrics end point reference can be found here. It is recommended to monitor the metrics at pipeline level. As an example to query warnings coming from a pipeline called Customer_DB_To_BQ which queries a metrics called system.app.log.warn for the last 1 day, the following end point should be used.

POST /v3/metrics/query?target=metric&tag=namespace:default&tag=app:Customer_DB_To_BQ&
metric=system.app.log.warn&start=now-1d&end=now

Note: The tag app refers to the pipeline. Each pipeline corresponds to an CDAP Application with the same name.

Pipeline Status

The following metrics are useful to measure for each pipeline run

system.program.completed.runs - Number of successful runs of pipelines
system.program.failed.runs - Number of failed runs of the pipelines
system.program.killed.runs - Number of runs of the pipelines that are manually stopped

Pipeline Logs

Each pipeline run emits logs and the number of warnings, errors are collected and stored in the metric system. This should give a quick indication of errors in the pipelines and can be correlated with failed runs of the pipelines to get an overview of the pipeline runs.

system.app.log.debug - Number of debug messages in a pipeline
system.app.log.error - Number of error messages in a pipeline
system.app.log.info - Number of info messages in a pipeline
system.app.log.warn - Number of warning messages in a pipeline

Note: For the metrics above in addition to monitoring at pipeline level, measuring the pipeline at run level is recommended.

Pipeline Execution

For each pipeline run duration of pipeline run and the node hours of Dataproc is computed and stored in the following metrics

system.program.run.seconds - Time in seconds for a pipeline run
system.program.node.minutes - Node minutes for the Dataproc cluster
system.program.provisioning.delay.seconds - Time taken to provision a Dataproc cluster

Note: For the metrics above in addition to monitoring at pipeline level, measuring the pipeline at run level is recommended.

Note: system.program.node.minutes metrics works well for a dynamic created (ephemeral) Dataproc cluster which is the default mode to run pipelines. If a static Dataproc cluster is configured the the node minutes is computed only for the duration of run for a single node in the cluster.