Ingesting Data into CDAP using Apache Flume

Source Code Repository: Source code (and other resources) for this guide are available at the CDAP Guides GitHub repository.

Ingesting realtime log data into Hadoop for analysis is a common use case which can be solved with Apache Flume. In this guide, you will learn how to ingest data into CDAP with Apache Flume and process it in realtime.

What You Will Build

You will build a CDAP application that uses web logs aggregated by Flume to find page view counts. You will:

Configure Flume to ingest data into a CDAP Stream;
Build a realtime Flow to process the ingested web logs; and
Build a Service to serve the analysis results via HTTP.

What You Will Need

Let’s Build It!

The following sections will guide you through configuring and running Flume, and implementing an application from scratch. If you want to deploy and run the application right away, you can clone the sources from this GitHub repository. In that case, feel free to skip the following two sections and jump directly to the Build and Run Application section.

Application Design

Web logs are aggregated using Flume which pushes the data to a webLogs Stream using a special StreamSink from the cdap-ingest library. Then, logs are processed in realtime with a Flow that consumes data from the webLogs Stream and persists the computation results in a pageViews Dataset. The WebLogAnalyticsService makes the computation results stored in the pageViews Dataset accessible via HTTP.

First, we will build the app, then deploy the app and start it. Once it is ready to accept and process the data, we will configure Flume to push data into the stream in realtime.

Application Implementation

The recommended way to build a CDAP application from scratch is to use a maven project. Use this directory structure:

./pom.xml
./src/main/java/co/cask/cdap/guides/PageViewCounterFlowlet.java
./src/main/java/co/cask/cdap/guides/WebLogAnalyticsApplication.java
./src/main/java/co/cask/cdap/guides/WebLogAnalyticsFlow.java
./src/main/java/co/cask/cdap/guides/WebLogAnalyticsHandler.java

WebLogAnalyticsApplication declares that the application has a Stream, a Flow, a Service and uses a Dataset:

public class WebLogAnalyticsApplication extends AbstractApplication {

  @Override
  public void configure() {
    setName("WebLogAnalyticsApp");
    addStream(new Stream("webLogs"));
    createDataset("pageViewTable", KeyValueTable.class);
    addFlow(new WebLogAnalyticsFlow());
    addService("WebLogAnalyticsService", new WebLogAnalyticsHandler());
  }
}

The WebLogAnalyticsFlow makes use of the PageViewCounterFlowlet:

public class WebLogAnalyticsFlow extends AbstractFlow {

  @Override
  public void configure() {
    setName("WebLogAnalyticsFlow");
    setDescription("A flow that collects and performs web log analysis");
    addFlowlet("pageViewCounter", new PageViewCounterFlowlet());
    connectStream("webLogs", "pageViewCounter");
  }
}

The PageViewCounterFlowlet receives the log events from the webLogs Stream. It parses the log event and extracts the requested page URL from the log event. Then it increments respective counter in the pageViewTable Dataset:

For example, given the following event:

the extracted requested page URL is https://accounts.example.org/signup. This will be used as a counter key in the pageViewTable Dataset.

WebLogAnalyticsHandler returns a map of the webpage and its page-views counts for an HTTP GET request at /views:

Build and Run Application

The WebLogAnalyticsApp can be built and packaged using the Apache Maven command:

Note that the remaining commands assume that the cdap-cli.sh script is available on your PATH. If this is not the case, please add it:

If you haven't already started a standalone CDAP installation, start it with the command:

We can then deploy the application to a standalone CDAP installation and start the flow and service:

Once the flow has started, it is ready to receive the web logs from the stream. Now, let’s configure and start Flume to push web logs into the stream.

Ingest Data with Flume

In the provided sources for this guide, you can find an Apache web server’s access.log file that we will use as a source of data. If you have access to live Apache web server’s access logs, you can use them instead.

In order to configure Apache Flume to push web logs to a CDAP Stream, you need to create a simple Flume flow which includes:

Flume source that tail access logs;
In-memory channel; and
Flume sink that sends log lines into the CDAP Stream.

In this example, we will configure the source to tail access.log and sink to send data to the webLogs stream.

Download Flume

You can download the Apache Flume distribution at the Apache Flume download.
Once downloaded, extract the archive into <flume-base-dir>:

Configure Flume Flow

Download the CDAP Flume sink jar into your Flume installation:

The CDAP Flume sink requires a newer version of Guava library than that is usually shipped with Flume. You need to replace the existing Flume Guava library with guava-17.0.jar:

Now, let’s configure the flow by creating the configuration file weblog-analysis.conf at <flume-base-dir>/conf with these contents:

Change <cdap-flume-ingest-guide-basedir> in the configuration file to point to the <cdap-flume-ingest-guide> directory. Alternatively, you can point it to /tmp/access.log, and create /tmp/access.log with these sample contents:

Run Flume Flow with Agent

To run a Flume flow, start an agent with the flow’s configuration:

Once the agent has started, it begins to push data to the CDAP Stream. The CDAP application, started earlier, processes the log events as soon as data is received. Then you can query the computed page views statistics.

Query Results

WebLogAnalyticsService exposes an HTTP endpoint for you to query the results of processing:

Example output:

Extend This Example

To make this application more useful, you can extend it:

Find the top visited pages by maintaining the top pages in a Dataset and updating them from the PageViewCounterFlowlet; and
Calculate the bounce ratio of web pages, with batch processing.

Share and Discuss!

Have a question? Discuss at the CDAP User Mailing List.

License

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

Knowledge Base