User-Friendly Logs Design Document

Checklist

  • User Stories Documented
  • User Stories Reviewed
  • Design Reviewed
  • APIs reviewed
  • Release priorities assigned
  • Test cases reviewed
  • Blog post

Introduction 

This document covers the design for User Friendly Logs for CDAP. Today its very cumbersome for a user to debug any failures in Pipelines using the current logs.

This is because of several reasons:

  1. The logs are filled with a large number of logs that the user is not interested in. These include debugs from CDAP platform code or other underlying dependencies.
  2. With the large number of logs shown, the interesting information for the User is often lost, rendering the logs unusable. 
  3. Moreover the interesting information is not clearly communicated in the logs. The logs are targeted towards the developer rather than the User. 
  4. Errors are wrapped over multiple times and do not communicate the root cause of the problem for the user and how the User can recover from the error. 

This document covers the items required to address the above mentioned shortcomings with the CDAP logs. The document mainly covers all work items for Release 4.2

Goals

  1. CDAP Pipeline and Program Logs must help User assess the progress of the program in success scenario and debug in case of failures.  
  2. Provide Guidelines for logs targeted for Users and the Log level that should be used for them. 

User Stories 

  • As a CDAP User I want to see crisp and concise logs clearly showing the progress of my Pipeline, Application or Program.

  • As a CDAP User I want to see error messages very clearly conveyed in the logs in case of any failures.

  • As a CDAP User I want to see the error message being helpful in recovering from the problem reported.

Design

high level design

Approach

There are the major work items involved:

  • Error Handling: Error must be reported very clearly to the User in a way that helps them recover from the problem. Work items: 
    • Errors from AM are sent to stderr/stdout. This is probably because the bridge is not set up correctly. These errors must come through the logback framework so that they can be logged at the correct level with appropriate details. Ideally no errors should go to stdout/stderr. This involves making sure the SLF4J bridge jars are included in the job jars. 
    • This also happens for two other packages: jetty and kafka producer. 
    • ETL Lifecycle Errors: Propagate Errors from Initialize, Configure, Running stages of Pipelines in Pipeline Logs. The logs contain Plugin Name and Stage Name for context.
    • Existing Errors: Check usage specifically of TransactionTimeoutException and RuntimeException
    • Example Pipelines Errors
        
  • Context Based Logging: Logs today are tagged with a logging context that contains program run id details. In addition to these tags, more context MDC tags will be added which can be used from the UI to filter logs that user would be interested in.
    • Lifecycle: Logs that represent the lifecycle of a program or a pipeline.
    • Error: In case of failures, the most interesting errors for a user must be tagged
    • Other interesting information can be tagged using specific tags. 

  • Program Logs Cleanup: This involves an overall cleanup of Program Logs. Today the logs are flooded with developer debug logs which are of the least interest to the User. Also there are several errors/warnings from Hadoop system often because of missing/incorrect user configuration. These needs to be cleaned up and the logs have to be written in a way that they target the User and not the developer.

  • Guidelines for Future Development: Formulate a set of guidelines for logs that will be followed for future development. 

 

Guidelines

The Guidelines for the logs and the log levels are discussed here. 

 

Known Issues

This is the list of Errors that show up from the underlying system and are known errors which cannot be fixed. In most cases, they cannot be avoided using logback logger levels because they are logged at a ERROR level. 

  1. Akka Association Error on CDH 5.5

    2017-05-09 00:06:36,869 - ERROR [sparkDriver-akka.actor.default-dispatcher-17:a.r.EndpointWriter@65] - AssociationError [akka.tcp://sparkDriver@10.250.0.23:43289] <- [akka.tcp://driverPropsFetcher@newcdh22886-1000.dev.continuuity.net:43137]: Error [Shut down address: akka.tcp://driverPropsFetcher@newcdh22886-1000.dev.continuuity.net:43137] [

    akka.remote.ShutDownAssociation: Shut down address: akka.tcp://driverPropsFetcher@newcdh22886-1000.dev.continuuity.net:43137

    Caused by: akka.remote.transport.Transport$InvalidAssociationException: The remote system terminated the association because it is shutting down.

    ]

    This is a known issue on CDH 5.5. More context: http://stackoverflow.com/questions/32627545/lots-of-error-errormonitor-associationerror-on-spark-startup

  2. YARN bug caused by a race between NM, RM

    2017-05-08 22:51:18,183 - ERROR [RMCommunicator Allocator:o.a.h.m.v.a.r.RMContainerAllocator@784] - Container complete event for unknown container container_1494282050386_0003_01_000009
    Bug: 

    https://issues.apache.org/jira/browse/YARN-3535

     

 

API changes

New Programmatic APIs

New Java APIs introduced (both user facing and internal)

Deprecated Programmatic APIs

New REST APIs

PathMethodDescriptionResponse CodeResponse
No new REST APIs are planned   

 

     

Deprecated REST API

PathMethodDescription
No Deprecations planned  

CLI Impact or Changes

  • No CLI impact

UI Impact or Changes

  • Several UI changes are involved to improve the readability of the logs. These changes are being addressed separately and are not covered in this document.

Security Impact 

There is no impact on Authorization or other security items.

Impact on Infrastructure Outages 

System behavior (if applicable - document impact on downstream [ YARN, HBase etc ] component failures) and how does the design take care of these aspect

Test Scenarios

Test IDTest DescriptionExpected Results
   
   
   
   

Releases

Release 4.2.0

All the above work items will be addressed as part of release 4.2.0

Related Work

  • Work #1
  • Work #2
  • Work #3

 

Future work