Github Statistics

The purpose of this page is to document the plan for redesigning the Github statistic collection in the Caskalytics app.

Goals for Redesign

The idea behind the redesign is to create a standalone "mini" app that someone can install in their CDAP platform which will passively collect Github webhook messages and analyze them as needed. The current implementation is limited because it uses periodic polling of the Github API to gather only information about specific repos. The idea behind this redesign is to expose an endpoint that will collect and store all information posted to it from Github webhooks including comments, PRs, new issues, new repos, etc. For more information, see https://developer.github.com/webhooks/ .

Ideally, an organization could configure this webhook at the org level to passively capture all changes made to their Github account.

Current Implementation of Github Metrics

  • API: https://developer.github.com/v3/

  • Use a Workflow Custom Action to run periodic RESTful calls to the Github API

  • Results will be written into the GitHub partition of the Fileset.

  • A MapReduce job will periodically read from the GitHub partition of the Fileset, and update the Cube dataset.

Use Cases

  • As a user of Caskalytics, I would like to store and retrieve all activity associated with my Github organization.
  • As a user of Caskalytics, I would like to view metrics for my Github repositories including forks, pull requests, watchers, stargazers and open issues.
  • As a user of Caskalytics, I would like to view metrics about the members of my organization such as number of issues opened, number of pull requests created.
  • As a user of Caskalytics, I would like to view a histogram of metrics about my repositories.

New implementation of Github Metrics

  • Expose a service that accepts and verifies valid webhook messages from Github and writes those messages to a Datatable.
    • This will collect both the raw messages as well as a metrics table for collecting stats at a repo and user level
  • Expose a RESTful endpoint to query metrics from the aggregates table and return results in JSON
  • Use the data service to create some sort of visual display of the information.

Additional Value

  • Because we are collecting each message as it happens in Github, we can have a more real-time overview of what's happening in our Org (data updates faster)
  • Setting this at the Org level means we don't have to configure it for each repo. As new repos are added, their data will start showing up in our metrics (decreased maintenance)
  • Since we have the raw messages stored, we can reprocess them at any time to extract additional metrics that we may not know we need yet (future-proof)
  • No additional map-reduce job needed since we are collecting metrics as they happen (simplify)

Metrics Calculated

  • Metrics will be stored in a seperate dataset from the raw messages
  • Repo messages will overwrite each time a new message is received from Github
  • All Time Metrics
    • Per Repository
      • repo size
      • stargazers_count
      • watchers_count
      • forks_count
      • total pull requests
    • Per Message
      • count
    • Per Repo / Per Message
      • count
    • Per Sender

Capture Endpoint

  • The capture endpoint will be a catch all endpoint that accepts POST messages from Github, verifies their authenticity, and writes the message to the data store.
  • Each message should have the following headers to be considered "valid"
    • User-Agent should start with GitHub-Hookshot/<id>
    • X-GitHub-Delivery should be a UUID for the message
    • X-GitHub-Event should be the name of the message
    • X-Hub-Signature should contain an sha1 digest of the message for verification
    • payload should be the json message
  • If any required headers are missing or invalid, the response will be UNAUTHORIZED with a message stating that they are not authorized to call the service.
  • If the Event is missing, a BAD_REQUEST is returned.
  • If there is no payload, a BAD_REQUEST is returned
  • if the payload digest does not match the one provided in the Signature header  or there is an error generating it, a BAD_REQUEST is returned
  • When everything is successful, an OK is returned with a message that it was successfully processed

Metrics Endpoints

  • These will be REST endpoints used to get repo stats for Caskalytics
    • MethodEndpointDescriptionParametersResponse
      GET/{org}/{repo}/statsReturns the stats of the given repo
      NameDescriptionRequired?
      orgString - the org for the repoYes
      repoString - the name of the repoYes
      {
        "name": "russorat/savage-leads",
        "size": 481,
        "forks": 0,
        "watchers": 1,
        "stargazers": 1,
        "openIssues": 3,
        "totalPullRequests": 2
      }
      GET/{org}/{repo}/messages/{messageType}Returns the messages for a given repo. A list of events can be found here: https://developer.github.com/webhooks/#events
      NameDescriptionRequiredDefault
      orgString - the org for the repoYes 
      repoString - the name of the repoYes 
      messageTypeString - the type of message to returnYes 
      startTimestart time to search for in SecondsNo0
      endTimeend time to search for in SecondsNonow
      {
        totalMessages: 2,
        messages: ["{...}","{...}"]
      }
      GET/{sender}/statsReturns statistics for a given github user (sender). If no sender is found, an empty stats list is returned.
      NameDescriptionRequiredDefault
      senderString - The github username to get stats forYes 
      {
        "sender": "russorat",
        "stats": {
          "issue_comment": 1,
          "issues": 3,
          "create": 1,
          "ping": 1,
          "push": 1
        }
      }
      {
        "sender": "russoratsdfsdf",
        "stats": {}
      }
      GET/topSenders/{messageType}?limit={limit}Returns an array of the top senders for the given message type
      NameDescriptionRequiredDefault
      messageTypeString - The type of message to get the top senders forYes 
      limitlong - The number of results to returnNo10
      [
        {
          "sender": "russorat",
          "stats": {
            "push": 1
          }
        }
      ]
      GET/{org}/{repo}/metric?metric={metric}Returns a given custom metric for a repo
      NameDescriptionRequiredDefault
      orgString - the org for the repoYes 
      repoString - the name of the repoYes 
      metricString - the custom metric to returnYes 
      {
        repoName: "russorat/savage-leads",
        metricName: "repository.watchers",
        metric: 0
      }
      GET/{messageId}Returns the raw message given a message id
      NameDescriptionRequiredDefault
      messageIdString - the Github message id to return. Can be found using the messages endpointYes 
      {
        "ref": "refs/heads/testbranch",
        "before": "0000000000000000000000000000000000000000",
        "after": "6d6db4855be89fb10f5b09a214a20b6125cd7be8",
        "created": true,
        "deleted": false,
        "forced": true,
        "base_ref": "refs/heads/master",
        "compare": "https://github.com/russorat/savage-leads/compare/testbranch",
        "commits": [],
        ...
      }
      GET/{org}/{repo}/messages/{messageType}?startTime={startTime}&endTime={endTime}&limit={limit}&offset={offset}Returns a list of message Ids for the given repo and message type
      NameDescriptionRequiredDefault
      orgString - the org for the repoYes 
      repoString - the name of the repoYes 
      messageTypeString - the type of message to search (push, issue, pull_request, etc.)Yes 
      startTimelong - the start time as a unix timestamp in secondsNo0
      endTimelong - the end time as a unix timestamp in secondsNoNow
      limitint - the number of results to returnNo10
      offsetint - the offset used for pagingNo0
      {
        "totalMessages": 1,
        "messageIds": [
          "132e1700-efa8-11e5-844f-7d105e7a1526"
        ]
      }

 

Github Raw Dataset

  • Dataset to store the raw messages captured from Github
  • Key is the X-GitHub-Delivery header of the message
  • The table has three columns, one for the messageId (String), one for the messageType (String), and one for the jsonPayload (String)
  • This table is RecordScannable so the data can be viewed in the UI.

Github Parsed Dataset

  • Dataset will contain a Table to hold the parsed messages.
  • The JSON message is first flattened and then each value inserted as a column in the Table. Additional columns for eventId and messageType are also added. 
  • The key to the table will be <fullRepoName>-<messageType>-<inverseTimestampInSeconds>-<X-GitHub-Delivery>. This will allow scanning by message and by time with the most recent messages returned first.

Github Metrics

  • Data is stored in a Cube dataset
  • The Cube will have the following properties
    • Resolutions: 60,3600,86400,604800
    • Dimensions: 
      • repository
      • message_type
      • repository, message_type
      • sender
      • sender, message_type

Integrating with Caskalytics

  • Caskalytics code will need to be updated to call the new repo stats endpoints as needed
  • From what I can tell, we would only need to update the front end to query these endpoints. The backend logic used to query data from github can be removed.