Overview
Data such as Cask website access logs, email data, GitHub statistics, etc. are scattered across sources. Our application goal is to aggregate relevant data (such as accesses, stars, and commits) and then generate useful statistics in one location.
Motivation
To be able to generate and display aggregates and trends in one central location, and to test the use of CDAP to accomplish these goals.
Requirements
The system processes Cloudfront S3 bucket data, Apache access log or any beacon data in as near real-time as possible to generate aggregates.
This system is able to support more than one site (www.cask.co, blog.cask.co, cdap.io, coopr.io).
Aggregates are generated within a single site (over a period, we will have aggregates across all sites).
Retrieval is optimized and should not incur any additional cost—meaning the data retrieved should not be pulled multiple times.
Data should be processed without any data loss.
Aggregates to be generated:
Number of page views
Number of sessions
Number of downloads (CDAP, Coopr, Tigon, Tephra)
Number of unique users
Number of pageviews by url
Bounce Rate
The system is extendable to generate more metrics if needed over a period of time.
The statistics should be aggregated at different time intervals:
Hourly
Daily
Weekly
Monthly
Every 3 months
Every 1 year
The stats should have additional dimensions:
Country
Browser
Device
System should be able to process and catch-up in case of major outages.
System should have the ability to visualize metrics in the form of Dashboard Widgets: Line, Bar, Pie, Scatter, etc.
Dashboard and backend should support overlaying week-over-week, month-over-month, or year-over-year for any metric.
System should have the ability to generate multiple reports, specified by metrics and aggregation level.
System should have the ability to configure notifications based on constraints specified for metrics:
High-mark reached for CDAP downloads
Low-mark for page views
Notifications on weekly and daily changes
The system is highly-available and the reports are available 24x7x365
The system should maintain historical data for about two years
System should provide capabilities to introspect data by allowing users to arbitrarily query the data using SQL
Assumptions
Able to install logstash on all of our internal servers.
Able to install cgi-scripts on our internal servers to host external web beacons.
Able to modify the source of all our external websites to include a beacon image.
Design
Architecture
Partitioning of TimePartitionedFileset
Each data source will be in its own TPFS instance
Github: “GithubTPFS”
Format: Avro Record with fields - ts, repo, stars, forks, watchers, pulls
Cube Name: “GithubCube”
Websites: “WebTPFS”
All websites will be aggregated together, but each web beacon will add a tag to each event to differentiate between which sites the events come from, so the Flow can write to the appropriate Cube
Stream Name: “SQSStream”
Cube Name: “WebCube”
S3: “S3TPFS”
Format: Avro Record with fields - ts, body, type
Will read from S3 repository
Cube Name: “S3Cube”
Batch Processing
Cloudfront S3
S3 Source Plugin
Extend app templates to read from S3 directly via hadoop-aws
User needs to specify:
Access Key ID
Secret Access Key
List of S3 paths user wants to read from
For each path:
the (regex) pattern of log events to parse
the (regex) pattern of filenames for filtering files already read
Note: The prototype filters by filename (for now), but in the future, we may filter files by modification date for a more general way of keeping track of which files we have read up to.
Processing Pipeline
Read events using FileBatchSource plugin and (regex) pattern of events to extract necessary information from logs (LogParser transform):
URI
Timestamp
IP
Device
Browser
Adapter will output data into a PartitionedFileSet
Another MapReduce job will periodically read from the S3 partition in the Fileset, and update the Cube dataset.
Github Metrics
Use a Workflow Custom Action to run periodic RESTful calls to the Github API
Results will be written into the GitHub partition of the Fileset.
A MapReduce job will periodically read from the GitHub partition of the Fileset, and update the Cube dataset.
Real-time Processing
Web Beacons (JS)
Main idea: JavaScript on the page will send a message to Amazon SQS containing the necessary information.
<script src="https://sdk.amazonaws.com/js/aws-sdk-2.1.35.min.js"></script>
<script type="text/javascript">
AWS.config.update(//authentication info);
var QUEUE_URL = 'https://sqs.us-west-1.amazonaws.com/058529096359/caskaqueue';
var sqs = new AWS.SQS({region : 'us-west-1'});
var params = {
MessageBody: JSON.stringify('message''),
QueueUrl: QUEUE_URL
};
sqs.sendMessage(params, function(err,data){
if(err) console.log('error:',"Fail Send Message" + err);
else console.log(data);
});
</script>
Our beacons need to convey page views:
Beacons will be statically embedded on every page, which will invoke scripts on internal web servers; logstash will then forward these events to streams/Kafka.
(Page clicks/downloads will be stored/processed through S3)
CGI variables: http://www.oreilly.com/openbook/cgi/ch02_02.html
User Privacy Concerns?
Processing Pipeline
Javascript will send message to Amazon SQS
ETLRealtime Adapter will read from SQS into the Stream
Once the data is in the streams, the data will be fetched with two components:
A flow will read and parse events, and write into the Cube Dataset
A MapReduce job will periodically read from the stream to migrate the events into the central data lake.
User Interface
Analytics
Metric organization
Github
Metrics we will show:
Stars
Forks
Watchers
Pull Requests
S3 logs, web pages (www.cask.co, blog.cask.co, cdap.io, coopr.io)
Aggregates are generated within a single site (over a period we will have aggregate across all sites)
Metrics we will show:
page views
sessions
downloads
unique users
bounce rate
Trends
UI will allow refining all metrics to different time granularities (Hourly, Daily, Weekly, Monthly, Every three months, Every year)
Visualize metrics in the form of Dashboard (Widgets - Line, Bar, Pie, Scatter, ...)
Dashboard and backend should support overlaying week-over-week, month-over-month or year-over-year for any metric
Start off with existing CDAP dashboard and extend UI to what we need
Use Shilpa’s app template to show our trends.
Also allow for raw querying of data through SQL commands
Reports
Generate multiple reports: specified by metrics and aggregation level
Export data into PDF/Excel available for download in UI
Alerts
Allow user to specify some threshold values for metrics that will alert by email
High-mark reached for CDAP download
Low-mark for page views
Notifications on weekly and daily changes
Caching
Possible caching of recent queries for optimization.
Fault Tolerance
Web Beacons
Web beacons will invoke the remote script, which will write information to internal web server, logstash will then forward from there
If the beacon fails to invoke the remote script (proxy/network issues), we may lose information, but user did not load entire webpage anyways
Reading from S3
We keep track of which files we have read up to
May also need to keep track of specific files that have failed, so we can re-read into our system
Files are read in batch processing, so fault tolerance is easier to maintain than real-time
GitHub
Only periodic calls to the Github API are made, no data processing is done
If Github servers are unable to be reached, then we just lose a data point at that time, not a big issue
Data lake (PartitionedFileSet), Dataset Sinks
Main concern is with hardware
Data is saved to disk, so should retain data even on failure
Need some sort of replication, so data is not lost if server fails
Should have enough memory to maintain historical data for about 2 years
Needs to be highly available, so should be able to feed data, UI will be able to access information
Hardware
Work with Dev Ops to make sure hardware is configured to support application correctly
Replication, HA, memory (see Fault Tolerance - Dataset Sinks above)
Need additional hosts for web servers (beacons), redirect servers, proxy/load balancers
Determine how many new servers are necessary, if any
Deployment
Two options
Script that curls configuration files and application JARs to the CDAP framework
Pros
Can use ETL plugins, simpler to deploy templates as only JSON configs are needed (do not need to duplicate code)
Can also automate the starting/stopping of certain components with scripts
Cons
Portability issues
Write a single app that encapsulates all components
Pros
Easy to deploy as all components will be in a single JAR
Cons
Cannot use ETL
Lots of duplicate code within app. If need to update/fix components in the future, will be harder to maintain
To start/stop certain components, will still need scripts and/or user will need to go into UI to manually do it