Overview

Data such as Cask website access logs, email data, GitHub statistics, etc. are scattered across sources. Our application goal is to aggregate relevant data (such as accesses, stars, and commits) and then generate useful statistics in one location.

Motivation

To be able to generate and display aggregates and trends in one central location, and to test the use of CDAP to accomplish these goals.

Requirements

The system processes Cloudfront S3 bucket data, Apache access log or any beacon data in as near real-time as possible to generate aggregates.
This system is able to support more than one site (www.cask.co, blog.cask.co, cdap.io, coopr.io).
Aggregates are generated within a single site (over a period, we will have aggregates across all sites).
Retrieval is optimized and should not incur any additional cost—meaning the data retrieved should not be pulled multiple times.
Data should be processed without any data loss.
Aggregates to be generated:

Number of page views
Number of sessions
Number of downloads (CDAP, Coopr, Tigon, Tephra)
Number of unique users
Number of pageviews by url
Bounce Rate

The system is extendable to generate more metrics if needed over a period of time.
The statistics should be aggregated at different time intervals:

Hourly
Daily
Weekly
Monthly
Every 3 months
Every 1 year

The stats should have additional dimensions:

Country
Browser
Device

System should be able to process and catch-up in case of major outages.
System should have the ability to visualize metrics in the form of Dashboard Widgets: Line, Bar, Pie, Scatter, etc.
Dashboard and backend should support overlaying week-over-week, month-over-month, or year-over-year for any metric.
System should have the ability to generate multiple reports, specified by metrics and aggregation level.
System should have the ability to configure notifications based on constraints specified for metrics:

High-mark reached for CDAP downloads
Low-mark for page views
Notifications on weekly and daily changes
The system is highly-available and the reports are available 24x7x365

The system should maintain historical data for about two years
System should provide capabilities to introspect data by allowing users to arbitrarily query the data using SQL

Assumptions

Able to install logstash on all of our internal servers.
Able to install cgi-scripts on our internal servers to host external web beacons.
Able to modify the source of all our external websites to include a beacon image.

Design

Architecture

https://drive.google.com/a/cask.co/file/d/0B5ta1Sa6fTaVM0h0Yk1tc29oems/view?usp=sharing

Partitioning of TimePartitionedFileset

Each data source will be in its own TPFS instance

Github: “GithubTPFS”

Format: Avro Record with fields - ts, repo, stars, forks, watchers, pulls
Cube Name: “GithubCube”

Websites: “WebTPFS”

All websites will be aggregated together, but each web beacon will add a tag to each event to differentiate between which sites the events come from, so the Flow can write to the appropriate Cube
Stream Name: “SQSStream”
Cube Name: “WebCube”

S3: “S3TPFS”

Format: Avro Record with fields - ts, body, type
Will read from “repository.cask.co.logs/cloudfront/”
Cube Name: “S3Cube”

Batch Processing

Cloudfront S3

S3 Source Plugin

Extend app templates to read from S3 directly via hadoop-aws
User needs to specify:

Access Key ID
Secret Access Key
List of S3 paths user wants to read from
For each path:

the (regex) pattern of log events to parse
the (regex) pattern of filenames for filtering files already read
Note: The prototype filters by filename (for now), but in the future, we may filter files by modification date for a more general way of keeping track of which files we have read up to.

Processing Pipeline

Read events using FileBatchSource plugin and (regex) pattern of events to extract necessary information from logs (LogParser transform):

URI
Timestamp
IP
Device
Browser

Adapter will output data into a PartitionedFileSet
Another MapReduce job will periodically read from the S3 partition in the Fileset, and update the Cube dataset.

Github Metrics

API: https://developer.github.com/v3/
Use a Workflow Custom Action to run periodic RESTful calls to the Github API

http://docs.cdap.io/cdap/current/en/developers-manual/building-blocks/workflows.html
Issue: A Custom Action does not currently support writing to datasets; this can be implemented in the future (Resolved)

Results will be written into the GitHub partition of the Fileset.
A MapReduce job will periodically read from the GitHub partition of the Fileset, and update the Cube dataset.

Real-time Processing

Web Beacons (JS)

Main idea: JavaScript on the page will send a message to Amazon SQS containing the necessary information.

AWS.config.update(//authentication info);

var QUEUE_URL = 'https://sqs.us-west-1.amazonaws.com/058529096359/caskaqueue';

var sqs = new AWS.SQS({region : 'us-west-1'});

var params = {

MessageBody: JSON.stringify('message''),

QueueUrl: QUEUE_URL

};

sqs.sendMessage(params, function(err,data){

if(err) console.log('error:',"Fail Send Message" + err);

else console.log(data);

});

</script>

Our beacons need to convey page views:

Beacons will be statically embedded on every page, which will invoke scripts on internal web servers; logstash will then forward these events to streams/Kafka.
(Page clicks/downloads will be stored/processed through S3)

CGI variables: http://www.oreilly.com/openbook/cgi/ch02_02.html
User Privacy Concerns?

Processing Pipeline

Javascript will send message to Amazon SQS
ETLRealtime Adapter will read from SQS into the Stream
Once the data is in the streams, the data will be fetched with two components:

A flow will read and parse events, and write into the Cube Dataset
A MapReduce job will periodically read from the stream to migrate the events into the central data lake.

User Interface

Analytics

Metric organization

Github

Metrics we will show:

Stars
Forks
Watchers
Pull Requests

S3 logs, web pages (www.cask.co, blog.cask.co, cdap.io, coopr.io)

Aggregates are generated within a single site (over a period we will have aggregate across all sites)
Metrics we will show:

page views
sessions
downloads
unique users
bounce rate

Trends

UI will allow refining all metrics to different time granularities (Hourly, Daily, Weekly, Monthly, Every three months, Every year)
Visualize metrics in the form of Dashboard (Widgets - Line, Bar, Pie, Scatter, ...)
Dashboard and backend should support overlaying week-over-week, month-over-month or year-over-year for any metric
Start off with existing CDAP dashboard and extend UI to what we need
Use Shilpa’s app template to show our trends.
Also allow for raw querying of data through SQL commands

Reports

Generate multiple reports: specified by metrics and aggregation level
Export data into PDF/Excel available for download in UI

Alerts

Allow user to specify some threshold values for metrics that will alert by email

High-mark reached for CDAP download
Low-mark for page views
Notifications on weekly and daily changes

Caching

Possible caching of recent queries for optimization.

Fault Tolerance

Web Beacons

Web beacons will invoke the remote script, which will write information to internal web server, logstash will then forward from there
If the beacon fails to invoke the remote script (proxy/network issues), we may lose information, but user did not load entire webpage anyways

Reading from S3

We keep track of which files we have read up to
May also need to keep track of specific files that have failed, so we can re-read into our system
Files are read in batch processing, so fault tolerance is easier to maintain than real-time

GitHub

Only periodic calls to the Github API are made, no data processing is done
If Github servers are unable to be reached, then we just lose a data point at that time, not a big issue

Data lake (PartitionedFileSet), Dataset Sinks

Main concern is with hardware
Data is saved to disk, so should retain data even on failure
Need some sort of replication, so data is not lost if server fails
Should have enough memory to maintain historical data for about 2 years
Needs to be highly available, so should be able to feed data, UI will be able to access information

Hardware

Work with Dev Ops to make sure hardware is configured to support application correctly

Replication, HA, memory (see Fault Tolerance - Dataset Sinks above)
Need additional hosts for web servers (beacons), redirect servers, proxy/load balancers

Determine how many new servers are necessary, if any

Deployment

Two options

Script that curls configuration files and application JARs to the CDAP framework

Pros

Can use ETL plugins, simpler to deploy templates as only JSON configs are needed (do not need to duplicate code)
Can also automate the starting/stopping of certain components with scripts

Cons

Portability issues

Write a single app that encapsulates all components

Pros

Easy to deploy as all components will be in a single JAR

Cons

Cannot use ETL
Lots of duplicate code within app. If need to update/fix components in the future, will be harder to maintain
To start/stop certain components, will still need scripts and/or user will need to go into UI to manually do it

Caskalytics