Skip to end of metadata
Go to start of metadata

You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 2 Next »

 

Overview


Data such as Cask website access logs, email data, GitHub statistics, etc. are scattered across sources. Our application goal is to aggregate relevant data (such as accesses, stars, and commits) and then generate useful statistics in one location.

Motivation


To be able to generate and display aggregates and trends in one central location, and to test the use of CDAP to accomplish these goals.

Requirements

  • The system processes Cloudfront S3 bucket data, Apache access log or any beacon data in as near real-time as possible to generate aggregates.

  • This system is able to support more than one site (www.cask.co, blog.cask.co, cdap.io, coopr.io).

  • Aggregates are generated within a single site (over a period, we will have aggregates across all sites).

  • Retrieval is optimized and should not incur any additional cost—meaning the data retrieved should not be pulled multiple times.

  • Data should be processed without any data loss.

  • Aggregates to be generated:

    • Number of page views

    • Number of sessions

    • Number of downloads (CDAP, Coopr, Tigon, Tephra)

    • Number of unique users

    • Number of pageviews by url

    • Bounce Rate

  • The system is extendable to generate more metrics if needed over a period of time.

  • The statistics should be aggregated at different time intervals:

    • Hourly

    • Daily

    • Weekly

    • Monthly

    • Every 3 months

    • Every 1 year

  • The stats should have additional dimensions:

    • Country

    • Browser

    • Device

  • System should be able to process and catch-up in case of major outages.

  • System should have the ability to visualize metrics in the form of Dashboard Widgets: Line, Bar, Pie, Scatter, etc.

  • Dashboard and backend should support overlaying week-over-week, month-over-month, or year-over-year for any metric.

  • System should have the ability to generate multiple reports, specified by metrics and aggregation level.

  • System should have the ability to configure notifications based on constraints specified for metrics:

    • High-mark reached for CDAP downloads

    • Low-mark for page views

    • Notifications on weekly and daily changes

    • The system is highly-available and the reports are available 24x7x365

  • The system should maintain historical data for about two years

  • System should provide capabilities to introspect data by allowing users to arbitrarily query the data using SQL

Assumptions

  • Able to install logstash on all of our internal servers.

  • Able to install cgi-scripts on our internal servers to host external web beacons.

  • Able to modify the source of all our external websites to include a beacon image.

Design


Architecture



  • Partitioning of TimePartitionedFileset

    • Each data source will be in its own TPFS instance

      • Github: “GithubTPFS”

        • Format: Avro Record with fields - ts, repo, stars, forks, watchers, pulls

        • Cube Name: “GithubCube”

      • Websites: “WebTPFS”

        • All websites will be aggregated together, but each web beacon will add a tag to each event to differentiate between which sites the events come from, so the Flow can write to the appropriate Cube

        • Stream Name: “SQSStream”

        • Cube Name: “WebCube”

      • S3: “S3TPFS”

        • Format: Avro Record with fields - ts, body, type

        • Will read from S3 repository

        • Cube Name: “S3Cube”

Batch Processing 

Cloudfront S3

  • S3 Source Plugin

    • Extend app templates to read from S3 directly via hadoop-aws

    • User needs to specify:

      • Access Key ID

      • Secret Access Key

      • List of S3 paths user wants to read from

      • For each path:

        • the (regex) pattern of log events to parse

        • the (regex) pattern of filenames for filtering files already read

        • Note: The prototype filters by filename (for now), but in the future, we may filter files by modification date for a more general way of keeping track of which files we have read up to.

  • Processing Pipeline

    • Read events using FileBatchSource plugin and (regex) pattern of events to extract necessary information from logs (LogParser transform):

      • URI

      • Timestamp

      • IP

      • Device

      • Browser

    • Adapter will output data into a PartitionedFileSet

    • Another MapReduce job will periodically read from the S3 partition in the Fileset, and update the Cube dataset.

Github Metrics


Real-time Processing

Web Beacons (JS)

  • Main idea: JavaScript on the page will send a message to Amazon SQS containing the necessary information.


<script src="https://sdk.amazonaws.com/js/aws-sdk-2.1.35.min.js"></script>

<script type="text/javascript">

AWS.config.update(//authentication info);

var QUEUE_URL = 'https://sqs.us-west-1.amazonaws.com/058529096359/caskaqueue';

var sqs = new AWS.SQS({region : 'us-west-1'});

var params = {                                                                                                                             

   MessageBody: JSON.stringify('message''),

   QueueUrl: QUEUE_URL

   };

sqs.sendMessage(params, function(err,data){

       if(err) console.log('error:',"Fail Send Message" + err);

       else    console.log(data);

       });

</script>


  • Our beacons need to convey page views:

    • Beacons will be statically embedded on every page, which will invoke scripts on internal web servers; logstash will then forward these events to streams/Kafka.

    • (Page clicks/downloads will be stored/processed through S3)

  • CGI variables: http://www.oreilly.com/openbook/cgi/ch02_02.html

  • User Privacy Concerns?

  • Processing Pipeline

    • Javascript will send message to Amazon SQS

    • ETLRealtime Adapter will read from SQS into the Stream

    • Once the data is in the streams, the data will be fetched with two components:

      • A flow will read and parse events, and write into the Cube Dataset

      • A MapReduce job will periodically read from the stream to migrate the events into the central data lake.

User Interface


Analytics

  • Metric organization

    • Github

      • Metrics we will show:

        • Stars

        • Forks

        • Watchers

        • Pull Requests

    • S3 logs, web pages (www.cask.co, blog.cask.co, cdap.io, coopr.io)

      • Aggregates are generated within a single site (over a period we will have aggregate across all sites)

      • Metrics we will show:

        • page views

        • sessions

        • downloads

        • unique users

        • bounce rate


Trends

  • UI will allow refining all metrics to different time granularities (Hourly, Daily, Weekly, Monthly, Every three months, Every year)

  • Visualize metrics in the form of Dashboard (Widgets - Line, Bar, Pie, Scatter, ...)

  • Dashboard and backend should support overlaying week-over-week, month-over-month or year-over-year for any metric

  • Start off with existing CDAP dashboard and extend UI to what we need

  • Use Shilpa’s app template to show our trends.

  • Also allow for raw querying of data through SQL commands


Reports

  • Generate multiple reports: specified by metrics and aggregation level

  • Export data into PDF/Excel available for download in UI


Alerts

  • Allow user to specify some threshold values for metrics that will alert by email

    • High-mark reached for CDAP download

    • Low-mark for page views

    • Notifications on weekly and daily changes

Caching

  • Possible caching of recent queries for optimization.


Fault Tolerance

  • Web Beacons

    • Web beacons will invoke the remote script, which will write information to internal web server, logstash will then forward from there

    • If the beacon fails to invoke the remote script (proxy/network issues), we may lose information, but user did not load entire webpage anyways

  • Reading from S3

    • We keep track of which files we have read up to

    • May also need to keep track of specific files that have failed, so we can re-read into our system

    • Files are read in batch processing, so fault tolerance is easier to maintain than real-time

  • GitHub

    • Only periodic calls to the Github API are made, no data processing is done

    • If Github servers are unable to be reached, then we just lose a data point at that time, not a big issue

  • Data lake (PartitionedFileSet), Dataset Sinks

    • Main concern is with hardware

    • Data is saved to disk, so should retain data even on failure

    • Need some sort of replication, so data is not lost if server fails

    • Should have enough memory to maintain historical data for about 2 years

    • Needs to be highly available, so should be able to feed data, UI will be able to access information

Hardware

  • Work with Dev Ops to make sure hardware is configured to support application correctly

    • Replication, HA, memory (see Fault Tolerance - Dataset Sinks above)

    • Need additional hosts for web servers (beacons), redirect servers, proxy/load balancers

      • Determine how many new servers are necessary, if any

Deployment

  • Two options

    • Script that curls configuration files and application JARs to the CDAP framework

      • Pros

        • Can use ETL plugins, simpler to deploy templates as only JSON configs are needed (do not need to duplicate code)

        • Can also automate the starting/stopping of certain components with scripts

      • Cons

        • Portability issues

    • Write a single app that encapsulates all components

      • Pros

        • Easy to deploy as all components will be in a single JAR

      • Cons

        • Cannot use ETL

        • Lots of duplicate code within app.  If need to update/fix components in the future, will be harder to maintain

        • To start/stop certain components, will still need scripts and/or user will need to go into UI to manually do it

  • No labels