User-Friendly Logs Design Document
Checklist
- User Stories Documented
- User Stories Reviewed
- Design Reviewed
- APIs reviewed
- Release priorities assigned
- Test cases reviewed
- Blog post
IntroductionÂ
This document covers the design for User Friendly Logs for CDAP. Today its very cumbersome for a user to debug any failures in Pipelines using the current logs.
This is because of several reasons:
- The logs are filled with a large number of logs that the user is not interested in. These include debugs from CDAP platform code or other underlying dependencies.
- With the large number of logs shown, the interesting information for the User is often lost, rendering the logs unusable.Â
- Moreover the interesting information is not clearly communicated in the logs. The logs are targeted towards the developer rather than the User.Â
- Errors are wrapped over multiple times and do not communicate the root cause of the problem for the user and how the User can recover from the error.Â
This document covers the items required to address the above mentioned shortcomings with the CDAP logs. The document mainly covers all work items for Release 4.2
Goals
- CDAP Pipeline and Program Logs must help User assess the progress of the program in success scenario and debug in case of failures. Â
- Provide Guidelines for logs targeted for Users and the Log level that should be used for them.Â
User StoriesÂ
As a CDAP User I want to see crisp and concise logs clearly showing the progress of my Pipeline, Application or Program.
As a CDAP User I want to see error messages very clearly conveyed in the logs in case of any failures.
- As a CDAP User I want to see the error message being helpful in recovering from the problem reported.
Design
high level design
Approach
There are the major work items involved:
- Error Handling: Error must be reported very clearly to the User in a way that helps them recover from the problem. Work items:Â
- Errors from AM are sent to stderr/stdout. This is probably because the bridge is not set up correctly. These errors must come through the logback framework so that they can be logged at the correct level with appropriate details. Ideally no errors should go to stdout/stderr. This involves making sure the SLF4J bridge jars are included in the job jars.Â
- This also happens for two other packages: jetty and kafka producer.Â
- ETL Lifecycle Errors: Propagate Errors from Initialize, Configure, Running stages of Pipelines in Pipeline Logs. The logs contain Plugin Name and Stage Name for context.
- Existing Errors: Check usage specifically of TransactionTimeoutException and RuntimeException
- Example Pipelines Errors
 Â
- Context Based Logging: Logs today are tagged with a logging context that contains program run id details. In addition to these tags, more context MDC tags will be added which can be used from the UI to filter logs that user would be interested in.
- Lifecycle: Logs that represent the lifecycle of a program or a pipeline.
- Error: In case of failures, the most interesting errors for a user must be tagged
- Other interesting information can be tagged using specific tags.Â
- Program Logs Cleanup: This involves an overall cleanup of Program Logs. Today the logs are flooded with developer debug logs which are of the least interest to the User. Also there are several errors/warnings from Hadoop system often because of missing/incorrect user configuration. These needs to be cleaned up and the logs have to be written in a way that they target the User and not the developer.
- Guidelines for Future Development: Formulate a set of guidelines for logs that will be followed for future development.Â
Â
Guidelines
The Guidelines for the logs and the log levels are discussed here.Â
Â
Known Issues
This is the list of Errors that show up from the underlying system and are known errors which cannot be fixed. In most cases, they cannot be avoided using logback logger levels because they are logged at a ERROR level.Â
- Akka Association Error on CDH 5.5
2017-05-09 00:06:36,869 - ERROR [sparkDriver-akka.actor.default-dispatcher-17:a.r.EndpointWriter@65] - AssociationError [akka.tcp://sparkDriver@10.250.0.23:43289] <- [akka.tcp://driverPropsFetcher@newcdh22886-1000.dev.continuuity.net:43137]: Error [Shut down address: akka.tcp://driverPropsFetcher@newcdh22886-1000.dev.continuuity.net:43137] [
akka.remote.ShutDownAssociation: Shut down address: akka.tcp://driverPropsFetcher@newcdh22886-1000.dev.continuuity.net:43137
Caused by: akka.remote.transport.Transport$InvalidAssociationException: The remote system terminated the association because it is shutting down.
]
This is a known issue on CDH 5.5. More context:Â http://stackoverflow.com/questions/32627545/lots-of-error-errormonitor-associationerror-on-spark-startup
YARN bug caused by a race between NM, RM
2017-05-08 22:51:18,183 - ERROR [RMCommunicator Allocator:o.a.h.m.v.a.r.RMContainerAllocator@784] - Container complete event for unknown container container_1494282050386_0003_01_000009
Bug:Âhttps://issues.apache.org/jira/browse/YARN-3535
Â
Â
API changes
New Programmatic APIs
New Java APIs introduced (both user facing and internal)
Deprecated Programmatic APIs
New REST APIs
Path | Method | Description | Response Code | Response |
---|---|---|---|---|
No new REST APIs are planned | Â | Â | Â | Â |
 |  |  |  |  |
Deprecated REST API
Path | Method | Description |
---|---|---|
No Deprecations planned | Â | Â |
CLI Impact or Changes
- No CLI impact
UI Impact or Changes
- Several UI changes are involved to improve the readability of the logs. These changes are being addressed separately and are not covered in this document.
Security ImpactÂ
There is no impact on Authorization or other security items.
Impact on Infrastructure OutagesÂ
System behavior (if applicable - document impact on downstream [ YARN, HBase etc ]Â component failures) and how does the design take care of these aspect
Test Scenarios
Test ID | Test Description | Expected Results |
---|---|---|
 |  |  |
 |  |  |
 |  |  |
 |  |  |
Releases
Release 4.2.0
All the above work items will be addressed as part of release 4.2.0
Related Work
- Work #1
- Work #2
- Work #3
Â