The purpose of this page is to document the plan for redesigning the Github statistic collection in the Caskalytics app.
Goals for Redesign
The idea behind the redesign is to create a standalone "mini" app that someone can install in their CDAP platform which will passively collect Github webhook messages and analyze them as needed. The current implementation is limited because it uses periodic polling of the Github API to gather only information about specific repos. The idea behind this redesign is to expose an endpoint that will collect and store all information posted to it from Github webhooks including comments, PRs, new issues, new repos, etc. For more information, see https://developer.github.com/webhooks/ .
Ideally, an organization could configure this webhook at the org level to passively capture all changes made to their Github account.
Current Implementation of Github Metrics
Use a Workflow Custom Action to run periodic RESTful calls to the Github API
Results will be written into the GitHub partition of the Fileset.
A MapReduce job will periodically read from the GitHub partition of the Fileset, and update the Cube dataset.
New implementation of Github Metrics
- Expose a service that accepts and verifies valid webhook messages from Github and writes those messages to a Datatable.
- This will collect both the raw messages as well as a metrics table for collecting stats at a repo and user level
- Expose a RESTful endpoint to query metrics from the aggregates table and return results in JSON
- Use the data service to create some sort of visual display of the information.
Metrics Calculated
- Metrics will be stored in a seperate dataset from the raw messages
- Repo messages will overwrite each time a new message is received from Github
- All Time Metrics
- Per Repository
- repo size
- stargazers_count
- watchers_count
- forks_count
- Per Message
- count
- Per Repo / Per Message
- count
- Per Repository
Capture Endpoint
- The capture endpoint will be a catch all endpoint that accepts POST messages from Github, verifies their authenticity, and writes the message to the data store.
- Each message should have the following headers to be considered "valid"
- User-Agent should start with GitHub-Hookshot/<id>
- X-GitHub-Delivery should be a UUID for the message
- X-GitHub-Event should be the name of the message
- X-Hub-Signature should contain an sha1 digest of the message for verification
- payload should be the json message
- If any required headers are missing or invalid, the response will be UNAUTHORIZED with a message stating that they are not authorized to call the service.
- If the Event is missing, a BAD_REQUEST is returned.
- If there is no payload, a BAD_REQUEST is returned
- if the payload digest does not match the one provided in the Signature header or there is an error generating it, a BAD_REQUEST is returned
- When everything is successful, an OK is returned with a message that it was successfully processed
Metrics Endpoints
- These will be REST endpoints used to get repo stats for Caskalytics
Endpoint Description Parmeters /{org}/{repo}/stats Returns the stats of the given repo Name Description Required? org String - the org for the repo Yes repo String - the name of the repo Yes /{org}/{repo}/messages/{messageType} Returns the messages for a given repo Name Description Required org String - the org for the repo Yes repo String - the name of the repo Yes messageType String - the type of message to return Yes startTime start time to search for in Seconds. Defaults to 0 No endTime end time to search for in Seconds. Defaults to now No
Github Dataset
- Dataset will contain two stores: a Table to hold the raw messages and a Cube to hold the metrics.
- As the raw data is written to the Table store, the metrics in the Cube will be updated as needed
- The JSON message is first flattened and then each value inserted as a column in the Table. A final field called rawPayload is also written to capture the full payload.
- The key to the table will be <messageType>-<timestampInSeconds>-<X-GitHub-Delivery>. This will allow scanning by message and by time.
- The Cube will have the following properties
- Resolutions: 60,3600,86400,604800
- Dimensions:
- repository
- message_type
- repository, message_type