Checklist

User Stories Documented
User Stories Reviewed
Design Reviewed
APIs reviewed
Release priorities assigned
Test cases reviewed
Blog post

Introduction

Support Pluggable Log Saver Plugins in Log Saver for users to provide custom plugins to configure and store log messages and also improve log saver for other functionalities like versioning, impersonation, resilience, etc.

Existing Problems

Impersonation

Current implementation of LogSaver uses namespace directory to store logs which makes LogSaver dependent on impersonation. Impersonation could be avoided for logsaver if log files storage is not namespaced. For example, use a common root directory outside any CDAP namespace to store log files and use namespace-version-id to differentiate between namespaces. This would prohibit the users to look at their log files directly in HDFS as they are owned by "cdap". We need to consider if we have to provide options to make the "user" flexible for log files written by log saver plugins.

Entity Deletion and Recreation

LogSaver currently does not handle the case where a given entity is deleted and recreated with the same name. This is a common scenario on development clusters where developers can delete and recreate entity with the same name. There are couple of open issues present <link>

Pluggable Log Appenders

User might want to configure LogSaver to collect logs at different level and also write log messages at different destination. Having capability to add custom LogAppender will allow users to implement custom logic for logs in addition to existing CDAP LogSaver. For example, user might want to collect debug logs per application and store them on HDFS at application-dir/{debug.log} location.

DataStructures

TODO: Look into improving messageTable used for sorting log messages.

Resilience

Making LogSaver resilient to failure scenarios.

TODO: Add more details

Goals

Provides a unified framework for processing log events read off from the transport (Kafka currently).
Allows arbitrary number of processors to process the log events.
Provides implementations for common processors (e.g write to HDFS, rotation, expiration... etc).

User Stories

For each application, user wants to collect the application's logs into multiple logs files based on log level
For each application, user wants to configure a location in HDFS to be used to store the collected logs.
For each application, User wants the application log files stored in text format.
For each application, User would wants to configure the RotationPolicy of log files.
For each application, user wants to configure different layout for formatting log messages for the different log files generated.
User wants to group log messages at application level and write multiple separate log files for each application. Example: application-dir/{audit.log, metrics.log, debug.log}
User wants to write these log files to a configurable path in HDFS.
User wants to configure log saver extension using logback configuration.
User also wants to be able to configure rolling policy for these log files similar to log-back.

Design

Principles

Establish a clear role separation between the framework and individual processor
- Framework responsibilities
  - Reading log events from the log transport
  - Transport offset management
  - Grouping, buffering and sorting of log events
  - Configuration of log processors
  - Defines contract and services between the framework and log processors
    - Lifecycle management (instantiate, process, destroy)
    - Batch events processing
    - Impersonation
    - Exception handling policies for exceptions raised by log processors
- Processor responsibilities
  - Actual processing of events (mostly writing to somewhere)
  - Files and metadata management
Scalability
- Memory usage should be relatively constant regardless of the log events rate
  - It can varies depending on individual event size (message length, stacktrace, etc.), but it shouldn't vary too much
- Memory usage can be proportional to number of log processors
- Memory usage can be proportional to number of program/application logs being processed by the same log process
- Scaling to multiple JVM processes should only be along with the number of concurrent programs/applications running
Transport neutral
- In distributed mode, Kafka is being used. We may switch to TMS in future.
- In SDK / unit-test mode, can be just an in memory queue, or using local TMS.

Log Processing Framework

As mentioned above, the log processing framework (the framework) is responsible for dispatching log events to individual appenders to process, as well as making CDAP services available for appenders.

TODO

Log Processor

Given that the Logback library is a widely used logging library for SLF4J and is also the logging library used by CDAP, we can simply have log processors implemented through Logback Appender API, as well as configured through usual logback.xml configuration. Here are the highlights for such design:

Any number of log appenders can be configured through a logging framework specific logback.xml file (not the one used by CDAP nor for containers).
All appender configurations supported by logback.xml should be supported (e.g. appender-ref for delegation/wrapping).
Any existing Logback Appender implementations should be usable without any modification, as long as being properly configured in the runtime environment.
For Appender implementation that needs to gain access to CDAP services (e.g. dataset, transaction, metrics, impersonation... etc), a CDAP AppenderContext will be provided. There will be new CDAP interface for the Appender to optionally implemented in order to get hold of the AppenderContext instance.
An optional interface for Appender to implement to get notified for beginning and end of a batch of log events.

CDAP Log Appender

There will be a log appender implementation for CDAP, replacing the current log-saver. The CDAP log appender is responsible for:

Write log events into Avro files and maintaining metadata to those files.
- The metadata and the files generated are the only contract between the CDAP log appender and the log query handler.
Policy based rolling of files / metadata
- Removing files that passed TTL, enforcing max total size/number of files.
- See Logback rolling policy for examples.
Should implements the optional interfaces to get access to AppenderContext and log events batch start/end notifications
- Performs file sync (hsync for HDFS files) on batch end
- Also persist processing state into metadata store. The processing state contains the transport offsets (Kafka start/end offsets or TMS messageIds) provided via the batch start/end methods. The appender may store other state information generated by itself.
- Interim failure in writing the state to metadata store shouldn't fail the call.
  - Dataset service / HBase / TX service can have service interruption
  - Under normal operation, once those services become available, the latest states will be persisted.
  - The worst case scenario is having some duplicated log events in log file, but it's acceptable.
    - E.g. the log processing JVM died before successfully persisting the latest state

Aspects to consider:

Kafka Partitioning
Sorting and Bucketing
Checkpointing
Resilience
Classloader Isolation

Assumptions:

Appenders are synchronous

Approach

Approach #1

Approach #2

API changes

New Programmatic APIs

New Java APIs introduced (both user facing and internal)

Deprecated Programmatic APIs

New REST APIs

Path

Method

Description

Response Code

Response

/v3/apps/<app-id>

GET

Returns the application spec for a given application

200 - On success

404 - When application is not available

500 - Any internal errors

Deprecated REST API

Path	Method	Description
/v3/apps/<app-id>	GET	Returns the application spec for a given application

CLI Impact or Changes

Impact #1
Impact #2
Impact #3

UI Impact or Changes

Impact #1
Impact #2
Impact #3

Security Impact

What's the impact on Authorization and how does the design take care of this aspect

Impact on Infrastructure Outages

System behavior (if applicable - document impact on downstream [ YARN, HBase etc ] component failures) and how does the design take care of these aspect

Test Scenarios

Test ID	Test Description	Expected Results

Introduction

Existing Problems

Impersonation

Entity Deletion and Recreation

Pluggable Log Appenders

DataStructures

Resilience

Goals

User Stories

Design

Principles

Log Processing Framework

Log Processor

CDAP Log Appender

Approach

Approach #1

Approach #2

API changes

New Programmatic APIs

Deprecated Programmatic APIs

New REST APIs

Deprecated REST API

CLI Impact or Changes

UI Impact or Changes

Security Impact

Impact on Infrastructure Outages

Test Scenarios

Releases

Release X.Y.Z

Release X.Y.Z

Related Work

Future work

Log Framework

Introduction

Existing Problems

Impersonation

Entity Deletion and Recreation

Pluggable Log Appenders

DataStructures

Resilience

Goals

User Stories

Design

Principles

Log Processing Framework

Log Processor

CDAP Log Appender

Approach

Approach #1

Approach #2

API changes

New Programmatic APIs

Deprecated Programmatic APIs

New REST APIs

Deprecated REST API

CLI Impact or Changes

UI Impact or Changes

Security Impact

Impact on Infrastructure Outages

Test Scenarios

Releases

Release X.Y.Z

Release X.Y.Z

Related Work

Future work