Checklist
- User Stories Documented
- User Stories Reviewed
- Design Reviewed
- APIs reviewed
- Release priorities assigned
- Test cases reviewed
- Blog post
Introduction
This document covers the design for User Friendly Logs for CDAP. Today its very cumbersome for a user to debug any failures in Pipelines using the current logs.
This is because of several reasons:
- The logs are filled with a large number of logs that the user is not interested in. These include debugs from CDAP platform code or other underlying dependencies.
- With the large number of logs shown, the interesting information for the User is often lost, rendering the logs unusable.
- Moreover the interesting information is not clearly communicated in the logs. The logs are targeted towards the developer rather than the User.
- Errors are wrapped over multiple times and do not communicate the root cause of the problem for the user and how the User can recover from the error.
This document covers the items required to address the above mentioned shortcomings with the CDAP logs. The document mainly covers all work items for Release 4.2
Goals
- CDAP Pipeline and Program Logs must help User assess the progress of the program in success scenario and debug in case of failures.
- Provide Guidelines for logs targeted for Users and the Log level that should be used for them.
User Stories
As a CDAP User I want to see crisp and concise logs clearly showing the progress of my Pipeline, Application or Program.
As a CDAP User I want to see error messages very clearly conveyed in the logs in case of any failures.
- As a CDAP User I want to see the error message being helpful in recovering from the problem reported.
Design
high level design
Approach
There are the major work items involved:
- Error Handling: Error must be reported very clearly to the User in a way that helps them recover from the problem. Work items:
- Errors from AM are sent to stderr/stdout. This is probably because the bridge is not set up correctly. These errors must come through the logback framework so that they can be logged at the correct level with appropriate details. Ideally no errors should go to stdout/stderr. This involves making sure the SLF4J bridge jars are included in the job jars.
- This also happens for two other packages: jetty and kafka producer.
- Exceptions are wrapped several times over multiple layers. Logging these exceptions creates very long stack trace output. Work Items:
- CDAP will only log the root cause exception as error.
- The above is easier for the logs that are produced from CDAP. Eg Standalone. Error Logs generated from Hadoop system sometimes stringify the stack trace and there is not much we can do there. [in a clean way]
- Context Based Logging: Logs today are tagged with a logging context that contains program run id details. In addition to these tags, more context MDC tags will be added which can be used from the UI to filter logs that user would be interested in.
- Lifecycle: Logs that represent the lifecycle of a program or a pipeline.
- Error: In case of failures, the most interesting errors for a user must be tagged
- Other interesting information can be tagged using specific tags.
- Program Logs Cleanup: This involves an overall cleanup of Program Logs. Today the logs are flooded with developer debug logs which are of the least interest to the User. Also there are several errors/warnings from Hadoop system often because of missing/incorrect user configuration. These needs to be cleaned up and the logs have to be written in a way that they target the User and not the developer.
The Guidelines for the logs and the log levels are discussed here.
API changes
New Programmatic APIs
New Java APIs introduced (both user facing and internal)
Deprecated Programmatic APIs
New REST APIs
Path | Method | Description | Response Code | Response |
---|---|---|---|---|
No new REST APIs are planned |
| |||
Deprecated REST API
Path | Method | Description |
---|---|---|
No Deprecations planned |
CLI Impact or Changes
- No CLI impact
UI Impact or Changes
- Several UI changes are involved to improve the readability of the logs. These changes are being addressed separately and are not covered in this document.
Security Impact
There is no impact on Authorization or other security items.
Impact on Infrastructure Outages
System behavior (if applicable - document impact on downstream [ YARN, HBase etc ] component failures) and how does the design take care of these aspect
Test Scenarios
Test ID | Test Description | Expected Results |
---|---|---|
Releases
Release 4.2.0
All the above work items will be addressed as part of release 4.2.0
Related Work
- Work #1
- Work #2
- Work #3