Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

This guide will take you through building a simple CDAP application that ingests web logs, aggregates the request counts for different combinations of fields, and that can then be queried for the volume over a time period. You can then retrieve insights on the traffic of a web site and the web site’s health. You will:

  • Use a Stream to ingest real-time log data;

  • Build a FlowWorkflow to process log entries as they are received into multidimensional facts;

  • Use a Dataset to store the aggregated numbers; and

  • Build a Service to query the aggregated data across multiple dimensions.

...

Let’s Build It!

The following sections will guide you through building an application from scratch. If you are interested in deploying and running the application right away, you can clone its source code from this GitHub repository. In that case, feel free to skip the next two sections and jump right to the Build the Build and Run Application sectionApplication section.

Application Design

For this guide we will assume we are processing logs of a web-site that are produced by an Apache web server. The data could be collected from multiple servers and then sent to our application over HTTP. There are a number of tools that can help you with the ingestion task. We’ll skip over the details of ingesting the data (as this is covered elsewhere) and instead focus on storing and retrieving the data.

...

First, we need a place to receive and process the events. CDAP provides a reala real-time stream processing system that system that is a great match for handling event streams. After first setting the application name, our WebAnalyticsApp adds a new Streamnew Stream.

Then, the application configures a Cube dataset to compute and store aggregations for combinations of dimensions. Let’s take a closer look at the properties that are used to configure the Cube dataset:

...

Code Block
[
    {
        "measureName": "count",
        "dimensionValues": {},
        "timeValues": [
            {
                "timestamp": 1423375200,
                "value": 3
            },
            {
                "timestamp": 1423389600,
                "value": 1
            }
        ]
    }
]

Share and Discuss!

Have a question? Discuss at the CDAP User Mailing List.

License

Copyright © 2015 Cask Data, Inc.

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

...