The purpose of this page is to document the plan for redesigning the Github statistic collection in the Caskalytics app.
...
Current Implementation of Github Metrics
Use a Workflow Custom Action to run periodic RESTful calls to the Github API
Results will be written into the GitHub partition of the Fileset.
A MapReduce job will periodically read from the GitHub partition of the Fileset, and update the Cube dataset.
...
- These will be REST endpoints used to get repo stats for Caskalytics
- Dataset will contain two stores: a Table to hold the raw messages and a Cube to hold the metrics.
- As the raw data is written to the Table store, the metrics in the Cube will be updated as needed
Method Endpoint Description Parameters Response GET /{org}/{repo}/stats Returns the stats of the given repo Name Description Required? org String - the org for the repo Yes repo String - the name of the repo Yes Code Block { "name": "russorat/savage-leads", "size": 481, "forks": 0, "watchers": 1, "stargazers": 1, "openIssues": 3, "totalPullRequests": 2 }
GET /{org}/{repo}/messages/{messageType} Returns the messages for a given repo. A list of events can be found here: https://developer.github.com/webhooks/#events Name Description Required Default org String - the org for the repo Yes repo String - the name of the repo Yes messageType String - the type of message to return Yes startTime start time to search for in Seconds No 0 endTime end time to search for in Seconds No now Code Block { totalMessages: 2, messages: ["{...}","{...}"] }
GET /{sender}/stats Returns statistics for a given github user (sender). If no sender is found, an empty stats list is returned. Name Description Required Default sender String - The github username to get stats for Yes Code Block { "sender": "russorat", "stats": { "issue_comment": 1, "issues": 3, "create": 1, "ping": 1, "push": 1 } }
Code Block { "sender": "russoratsdfsdf", "stats": {} }
GET /topSenders/{messageType}?limit={limit} Returns an array of the top senders for the given message type Name Description Required Default messageType String - The type of message to get the top senders for Yes limit long - The number of results to return No 10 Code Block [ { "sender": "russorat", "stats": { "push": 1 } } ]
GET /{org}/{repo}/metric?metric={metric} Returns a given custom metric for a repo Name Description Required Default org String - the org for the repo Yes repo String - the name of the repo Yes metric String - the custom metric to return Yes Code Block { repoName: "russorat/savage-leads", metricName: "repository.watchers", metric: 0 }
GET /{messageId} Returns the raw message given a message id Name Description Required Default messageId String - the Github message id to return. Can be found using the messages endpoint Yes
Github Dataset
Code Block { "ref": "refs/heads/testbranch", "before": "0000000000000000000000000000000000000000", "after": "6d6db4855be89fb10f5b09a214a20b6125cd7be8", "created": true, "deleted": false, "forced": true, "base_ref": "refs/heads/master", "compare": "https://github.com/russorat/savage-leads/compare/testbranch", "commits": [], ... }
GET /{org}/{repo}/messages/{messageType}?startTime={startTime}&endTime={endTime}&limit={limit}&offset={offset} Returns a list of message Ids for the given repo and message type Name Description Required Default org String - the org for the repo Yes repo String - the name of the repo Yes messageType String - the type of message to search (push, issue, pull_request, etc.) Yes startTime long - the start time as a unix timestamp in seconds No 0 endTime long - the end time as a unix timestamp in seconds No Now limit int - the number of results to return No 10 offset int - the offset used for paging No 0 Code Block { "totalMessages": 1, "messageIds": [ "132e1700-efa8-11e5-844f-7d105e7a1526" ] }
Github Raw Dataset
- Dataset to store the raw messages captured from Github
- Key is the X-GitHub-Delivery header of the message
- The table has three columns, one for the messageId (String), one for the messageType (String), and one for the jsonPayload (String)
- This table is RecordScannable so the data can be viewed in the UI.
Github Parsed Dataset
- Dataset will contain a Table to hold the parsed messages.
- The JSON message is first flattened and then each value inserted as a column in the Table. A final field called rawPayload is also written to capture the full payload.Additional columns for eventId and messageType are also added.
- The key to the table will be <fullRepoName>-<messageType>-<timestampInSeconds><inverseTimestampInSeconds>-<X-GitHub-Delivery>. This will allow scanning by message and by time with the most recent messages returned first.
Github Metrics
- Data is stored in a Cube dataset
- The Cube will have the following properties
- Resolutions: 60,3600,86400,604800
- Dimensions:
- repository
- message_type
- repository, message_type
- sender
- sender, message_type
...