Logging integration

Introduction 

When running CDAP programs on Google Dataproc, anything logged by CDAP code on the Dataproc cluster is written only to files on the Dataproc cluster. This makes viewing the logs difficult, as a user must setup ssh access to the node(s) of the Dataproc cluster, remotely log in, identify where the log files are on the file system, and then finally view them.

In order to make it easier for users to view logs from the Dataproc cluster, we will leverage Google Stackdriver. There will be two aspects to the solution:

  1. Logs will be pushed from the CDAP-controlled JVM on the Dataproc cluster to Stackdriver.

  2. Logs will be rendered to the CDAP UI by a CDAP service.

Approaches

Ingestion into Stackdriver

We will use a Logback appender for Stackdriver. This will involve:

  1. Package the google-cloud-logging-logback jar file with the dataproc runtime extension module in CDAP.

  2. Copy the jar to the dataproc cluster and have it in the classpath of the JVM that we launch.

  3. Configure the logback of the JVM that we launch to use the Stackdriver log appender. This can be done programmatically, similar to how it is done in LogAppenderInitializer.

  4. Implement a LoggingEnhancer to add labels for the logs that we emit. This may be useful when querying the logs. This may not be necessary if the google cloud log querying can filter based upon MDC.

Viewing the logs

There are a couple of approaches for viewing the logs:

Approach #1: Use Client Java Library

Use the Stackdriver Logging Client libraries to fetch the logs from Stackdriver from a CDAP service.


Approach #2: Use Stackdriver REST API

Use the Stackdriver REST API to fetch the logs from Stackdriver within a CDAP service.

Pros:

  • More flexible than the Java library (programmatic library may be missing some functionality)

Cons:

  • More lines of code than using the Java library


Approach #3: Have Stackdriver export the logs to Cloud Storage, BigQuery, or Cloud Pub/Sub

Use Stackdriver’s Logs Export to have logs published to Cloud Storage, BigQuery, or Cloud Pub/Sub. In the case of Cloud Storage,

Pros:

  • More control over retention of logs

Cons:

  • Responsibility of retention now belongs to CDAP

  • More expensive, in the case that logs are not viewed often. Storage costs


Approach #4: View the logs from Stackdriver UI

Use the Stackdriver’s UI to view the logs directly.

Pros:

  • Avoid reimplementing functionality of a logs UI, such as filtering by timestamp, filtering by log level, search by text, as well as having an advanced filter syntax

Cons:
  • Not natively integrated in CDAP UI; would mean that user leaves CDAP UI in order to view the logs


Open Questions 

  1. How will the CDAP system map the CDAP program’s run ID to a Stackdriver query?
    1. Profiles can currently be deleted, whereas viewing logs for a program run should still work.
    2. If logs have TTL'd in stackdriver or if profile has been deleted, what do we show in the UI? REST API?
  2. How will logs emitted by the provisioner be consolidated?
  3. There are metrics emitted about program logs when processing the program logs, such as number of errors. With the stackdriver integration, there is no longer a process emitting such metrics. How will we emit such metrics? One possible way is to have a log appender that emits these metrics from each container. Need to consider the performance impact of this.
  4. How can we keep the implementation generic enough to also support other logging integrations?


API changes

New Programmatic APIs

New Java APIs introduced (both user facing and internal)

Deprecated Programmatic APIs

New REST APIs

PathMethodDescriptionRequest BodyResponse CodeResponse






Deprecated REST API

PathMethodDescription



CLI Impact or Changes

  • Impact #1
  • Impact #2
  • Impact #3

UI Impact or Changes

  • Impact #1
  • Impact #2
  • Impact #3

Security Impact 

What's the impact on Authorization and how does the design take care of this aspect

Impact on Infrastructure Outages 

System behavior (if applicable - document impact on downstream [ YARN, HBase etc ] component failures) and how does the design take care of these aspect

Test Scenarios

Test IDTest DescriptionExpected Results












Releases

Release X.Y.Z

Release X.Y.Z

Related Work

  • Work #1
  • Work #2
  • Work #3


Future work