Checklist
- User Stories Documented
- User Stories Reviewed
- Design Reviewed
- APIs reviewed
- Release priorities assigned
- Test cases reviewed
- Blog post
Introduction
Support Pluggable Log Saver Plugins in Log Saver for users to provide custom plugins to configure and store log messages and also improve log saver for other functionalities like versioning, impersonation, resilience, etc.
Existing Problems
Impersonation
Current implementation of LogSaver uses namespace directory to store logs which makes LogSaver dependent on impersonation. Impersonation could be avoided for logsaver if log files storage is not namespaced. For example, use a common root directory outside any CDAP namespace to store log files and use namespace-version-id to differentiate between namespaces. This would prohibit the users to look at their log files directly in HDFS as they are owned by "cdap". We need to consider if we have to provide options to make the "user" flexible for log files written by log saver plugins.
Entity Deletion and Recreation
LogSaver currently does not handle the case where a given entity is deleted and recreated with the same name. This is a common scenario on development clusters where developers can delete and recreate entity with the same name. There are couple of open issues present <link>
Pluggable Log Appenders
User might want to configure LogSaver to collect logs at different level and also write log messages at different destination. Having capability to add custom LogAppender will allow users to implement custom logic for logs in addition to existing CDAP LogSaver. For example, user might want to collect debug logs per application and store them on HDFS at application-dir/{debug.log} location.
DataStructures
TODO: Look into improving messageTable used for sorting log messages.
Resilience
Making LogSaver resilient to failure scenarios.
TODO: Add more details
Goals
- Provides a unified framework for processing log events read off from the transport (Kafka currently).
- Allows arbitrary number of processors to process the log events.
- Provides implementations for common processors (e.g write to HDFS, rotation, expiration... etc).
User Stories
- For each application, user wants to collect the application's logs into multiple logs files based on log level
- For each application, user wants to configure a location in HDFS to be used to store the collected logs.
- For each application, User wants the application log files stored in text format.
- For each application, User would wants to configure the RotationPolicy of log files.
- For each application, user wants to configure different layout for formatting log messages for the different log files generated.
- User wants to group log messages at application level and write multiple separate log files for each application. Example:
application-dir/{audit.log, metrics.log, debug.log}
- User wants to write these log files to a configurable path in HDFS.
- User wants to configure log saver extension using logback configuration.
- User also wants to be able to configure rolling policy for these log files similar to log-back.
Design
Principles
- Establish a clear role separation between the framework and individual processor
- Framework responsibilities
- Reading log events from the log transport
- Transport offset management
- Grouping, buffering and sorting of log events
- Configuration of log processors
- Defines contract and services between the framework and log processors
- Lifecycle management (instantiate, process, destroy)
- Batch events processing
- Impersonation
- Exception handling policies for exceptions raised by log processors
- Processor responsibilities
- Actual processing of events (mostly writing to somewhere)
- Files and metadata management
- Framework responsibilities
- Scalability
- Memory usage should be relatively constant regardless of the log events rate
- It can varies depending on individual event size (message length, stacktrace, etc.), but it shouldn't vary too much
- Memory usage can be proportional to number of log processors
- Memory usage can be proportional to number of program/application logs being processed by the same log process
- Scaling to multiple JVM processes should only be along with the number of concurrent programs/applications running
- Memory usage should be relatively constant regardless of the log events rate
- Transport neutral
- In distributed mode, Kafka is being used. We may switch to TMS in future.
- In SDK / unit-test mode, can be just an in memory queue, or using local TMS.
Log Processing Framework
As mentioned above, the log processing framework (the framework) is responsible for dispatching log events to individual appenders to process, as well as making CDAP services available for appenders.
TODO
Log Processor
Given that the Logback library is a widely used logging library for SLF4J and is also the logging library used by CDAP, we can simply have log processors implemented through Logback Appender
API, as well as configured through usual logback.xml
configuration. Here are the highlights for such design:
- Any number of log appenders can be configured through a logging framework specific
logback.xml
file (not the one used by CDAP nor for containers). - All appender configurations supported by
logback.xml
should be supported (e.g. appender-ref for delegation/wrapping). - Any existing Logback
Appender
implementations should be usable without any modification, as long as being properly configured in the runtime environment. - For
Appender
implementation that needs to gain access to CDAP services (e.g. dataset, transaction, metrics, impersonation... etc), a CDAPAppenderContext
will be provided. There will be new CDAP interface for theAppender
to optionally implemented in order to get hold of theAppenderContext
instance. - An optional interface for
Appender
to implement to get notified for beginning and end of a batch of log events.
CDAP Log Appender
There will be a log appender implementation for CDAP, replacing the current log-saver. The CDAP log appender is responsible for:
- Write log events into Avro files and maintaining metadata to those files.
- The metadata and the files generated are the only contract between the CDAP log appender and the log query handler.
- Policy based rolling of files / metadata
- Removing files that passed TTL, enforcing max total size/number of files.
- See Logback rolling policy for examples.
- Should implements the optional interfaces to get access to
AppenderContext
and log events batch start/end notifications- Performs file sync (hsync for HDFS files) on batch end
- Also persist processing state into metadata store. The processing state contains the transport offsets (Kafka start/end offsets or TMS messageIds) provided via the batch start/end methods. The appender may store other state information generated by itself.
- Interim failure in writing the state to metadata store shouldn't fail the call.
- Dataset service / HBase / TX service can have service interruption
- Under normal operation, once those services become available, the latest states will be persisted.
- The worst case scenario is having some duplicated log events in log file, but it's acceptable.
- E.g. the log processing JVM died before successfully persisting the latest state
Aspects to consider:
- Kafka Partitioning
- Sorting and Bucketing
- Checkpointing
- Resilience
- Classloader Isolation
Assumptions:
- Appenders are synchronous
Approach
Approach #1
Approach #2
API changes
New Programmatic APIs
New Java APIs introduced (both user facing and internal)
Deprecated Programmatic APIs
New REST APIs
Path | Method | Description | Response Code | Response |
---|---|---|---|---|
/v3/apps/<app-id> | GET | Returns the application spec for a given application | 200 - On success 404 - When application is not available 500 - Any internal errors |
|
Deprecated REST API
Path | Method | Description |
---|---|---|
/v3/apps/<app-id> | GET | Returns the application spec for a given application |
CLI Impact or Changes
- Impact #1
- Impact #2
- Impact #3
UI Impact or Changes
- Impact #1
- Impact #2
- Impact #3
Security Impact
What's the impact on Authorization and how does the design take care of this aspect
Impact on Infrastructure Outages
System behavior (if applicable - document impact on downstream [ YARN, HBase etc ] component failures) and how does the design take care of these aspect
Test Scenarios
Test ID | Test Description | Expected Results |
---|---|---|
Releases
Release X.Y.Z
Release X.Y.Z
Related Work
- Work #1
- Work #2
- Work #3