The purpose of this page is to document the plan for redesigning the Github statistic collection in the Caskalytics app.
Current Implementation of Github Metrics
Use a Workflow Custom Action to run periodic RESTful calls to the Github API
Results will be written into the GitHub partition of the Fileset.
A MapReduce job will periodically read from the GitHub partition of the Fileset, and update the Cube dataset.
- Expose a service that accepts and verifies valid webhook messages from Github and writes those messages to a Datatable.
- This will collect both the raw messages as well as a metrics table for collecting stats at a repo and user level
- Expose a RESTful endpoint to query metrics from the aggregates table and return results in JSON
- Use the data service to create some sort of visual display of the information.
Additional Value
- Because we are collecting each message as it happens in Github, we can have a more real-time overview of what's happening in our Org (data updates faster)
- Setting this at the Org level means we don't have to configure it for each repo. As new repos are added, their data will start showing up in our metrics (decreased maintenance)
- Since we have the raw messages stored, we can reprocess them at any time to extract additional metrics that we may not know we need yet (future-proof)
- No additional map-reduce job needed since we are collecting metrics as they happen (simplify)
Metrics Calculated
- Metrics will be stored in a seperate dataset from the raw messages
- Repo messages will overwrite each time a new message is received from Github
- All Time Metrics
- Per Repository
- repo size
- stargazers_count
- watchers_count
- forks_count
- total pull requests
- Per Message
- count
- Per Repo / Per Message
- count
- Per Sender
- Per Repository
- These will be REST endpoints used to get repo stats for Caskalytics
- Dataset will contain two stores: a Table to hold the raw messages and a Cube to hold the metrics.
- As the raw data is written to the Table store, the metrics in the Cube will be updated as needed
Method Endpoint Description Parameters Response GET /{org}/{repo}/stats Returns the stats of the given repo Name Description Required? org String - the org for the repo Yes repo String - the name of the repo Yes Code Block { "name": "russorat/savage-leads", "size": 481, "forks": 0, "watchers": 1, "stargazers": 1, "openIssues": 3, "totalPullRequests": 2 }
GET /{org}/{repo}/messages/{messageType} Returns the messages for a given repo. A list of events can be found here: Name Description Required Default org String - the org for the repo Yes repo String - the name of the repo Yes messageType String - the type of message to return Yes startTime start time to search for in Seconds No 0 endTime end time to search for in Seconds No now Code Block { totalMessages: 2, messages: ["{...}","{...}"] }
GET /{sender}/stats Returns statistics for a given github user (sender). If no sender is found, an empty stats list is returned. Name Description Required Default sender String - The github username to get stats for Yes Code Block { "sender": "russorat", "stats": { "issue_comment": 1, "issues": 3, "create": 1, "ping": 1, "push": 1 } }
Code Block { "sender": "russoratsdfsdf", "stats": {} }
GET /topSenders/{messageType}?limit={limit} Returns an array of the top senders for the given message type Name Description Required Default messageType String - The type of message to get the top senders for Yes limit long - The number of results to return No 10 Code Block [ { "sender": "russorat", "stats": { "push": 1 } } ]
GET /{org}/{repo}/metric?metric={metric} Returns a given custom metric for a repo Name Description Required Default org String - the org for the repo Yes repo String - the name of the repo Yes metric String - the custom metric to return Yes
Github Dataset
Code Block { repoName: "russorat/savage-leads", metricName: "repository.watchers", metric: 0 }
GET /{messageId} Returns the raw message given a message id Name Description Required Default messageId String - the Github message id to return. Can be found using the messages endpoint Yes Code Block { "ref": "refs/heads/testbranch", "before": "0000000000000000000000000000000000000000", "after": "6d6db4855be89fb10f5b09a214a20b6125cd7be8", "created": true, "deleted": false, "forced": true, "base_ref": "refs/heads/master", "compare": "", "commits": [], ... }
GET /{org}/{repo}/messages/{messageType}?startTime={startTime}&endTime={endTime}&limit={limit}&offset={offset} Returns a list of message Ids for the given repo and message type Name Description Required Default org String - the org for the repo Yes repo String - the name of the repo Yes messageType String - the type of message to search (push, issue, pull_request, etc.) Yes startTime long - the start time as a unix timestamp in seconds No 0 endTime long - the end time as a unix timestamp in seconds No Now limit int - the number of results to return No 10 offset int - the offset used for paging No 0 Code Block { "totalMessages": 1, "messageIds": [ "132e1700-efa8-11e5-844f-7d105e7a1526" ] }
Github Raw Dataset
- Dataset to store the raw messages captured from Github
- Key is the X-GitHub-Delivery header of the message
- The table has three columns, one for the messageId (String), one for the messageType (String), and one for the jsonPayload (String)
- This table is RecordScannable so the data can be viewed in the UI.
Github Parsed Dataset
- Dataset will contain a Table to hold the parsed messages.
- The JSON message is first flattened and then each value inserted as a column in the Table. A final field called rawPayload is also written to capture the full payload.Additional columns for eventId and messageType are also added.
- The key to the table will be <fullRepoName>-<messageType>-<timestampInSeconds><inverseTimestampInSeconds>-<X-GitHub-Delivery>. This will allow scanning by message and by time with the most recent messages returned first.
Github Metrics
- Data is stored in a Cube dataset
- The Cube will have the following properties
- Resolutions: 60,3600,86400,604800
- Dimensions:
- repository
- message_type
- repository, message_type
- sender
- sender, message_type