Logging

General Guidelines

Logs are primarily used to determine if a system is running smoothly or if there are problems. If there are problems, logs are expected to clearly describe what went wrong and give pointers for how to fix the problem. 

CDAP uses slf4j in its system and programs.

Descriptiveness

Individual log messages should be descriptive enough that they can stand on their own. Avoid logging messages that are only meaningful in the context of a previous message. A log message should have enough information that the user reading it knows what happened and what to do in response. For example, avoid messages like:


Program failed to start due to missing dataset.


This message does not tell the user which program failed and which dataset is missing, so further investigation cannot happen. A better message would be:


Program 'xyz' failed to start because dataset 'd123' does not exist.


Message length

Avoid creating large log messages. Large messages are usually created when a collection of objects is logged. In these situations, limit the items that are logged. For example, instead of:


Profile 'abc' could not be deleted because it is being used by programs ns1:app1:program2, ns1:app1:program5, ns2:app3:program6, ...


log a message like:


Profile 'abc' could not be deleted because it is being used by ns1:app1:program2 and 32 other programs.


Program Logs

Program logs are consumed by users who run programs, like pipelines or custom apps. When a pipeline fails, whether it be from mis-configuration, transient system errors, 

System Logs

This is logging that is about the CDAP system. It is generally consumed by a system administrator who is interested in making sure the platform is healthy. Errors in the system logs indicate s

Levels

It is important that the product is consistent about when to log at which level.


Error

An error message indicates something happened that could not be handled by the program or system. In code, it is usually because some exception was caught that the system could not recover from, so an operation failed and logged an error. In most circumstances it makes sense to log the stack trace to provide additional context and information. An error often indicates that user action is required in order to make progress. For example, if some system metadata could not be written because the underlying storage system in unavailable, it is appropriate to log a warning. An admin should be able to read the message and perform further investigation on that underlying storage system.


An error indicates a problem with the system itself and should not be used to log incorrect use of an API. For example, if a user tries to call an API but provides invalid input, an error should not be logged. 

If an error is routinely ignored, it is a sign that it should really be logged at info or lower.

Warn

A warning indicates an unexpected situation occurred but the system was able to recover and move on. A warning often indicates that some action may be required. For example, if the system tries to clean up temporary files but fails for some reason, it is appropriate to log a warning. Those temporary files will not impact future operations, but an admin may want to take a look to make sure there are no disk issues. A stack trace is usually not useful.

If a warning is routinely ignored, it is a sign that it should really be logged at info or lower.

Info

Info messages are not real problems but often give information 

Debug


Trace