The purpose of this page is to document the plan for redesigning the Github statistic collection in the Caskalytics app.
...
Current Implementation of Github Metrics
Use a Workflow Custom Action to run periodic RESTful calls to the Github API
Results will be written into the GitHub partition of the Fileset.
A MapReduce job will periodically read from the GitHub partition of the Fileset, and update the Cube dataset.
...
- Metrics will be stored in a seperate dataset from the raw messages
- Repo messages will overwrite each time a new message is recieved received from GithubPer User metrics will be incremented
- All Time Metrics
- Per Repository
- repo size
- stargazers_count
- watchers_count
- forks_count
- Per User
- Issues <action> (opened, closed, reopened) Issue Comment CreatedMessage
- count
- Per Repository
- Per Repo / Per Message
- count
Capture Endpoint
- The capture endpoint will be a catch all endpoint where the
...
- that accepts POST messages from Github, verifies their authenticity, and writes the message to the data store.
- Each message should have the following headers to be considered "valid"
- User-Agent should start with GitHub-Hookshot/<id>
- X-GitHub-Delivery should be a UUID for the message
- X-GitHub-Event should be the name of the message
- X-Hub-Signature should contain an sha1 digest of the message for verification
- payload should be the json message
- If any required headers are missing or invalid, the response will be UNAUTHORIZED with a message stating that they are not authorized to call the service.
- If the Event is missing, a BAD_REQUEST is returned.
- If there is no payload, a BAD_REQUEST is returned
- if the payload digest does not match the one provided in the Signature header or there is an error generating it, a BAD_REQUEST is returned
- When everything is successful, an OK is returned with a message that it was successfully processed
Metrics Endpoints
- These will be REST endpoints used to get repo stats for Caskalytics
Endpoint Description Parmeters /{org}/{repo}/stats Returns the stats of the given repo Name Description Required? org String - the org for the repo Yes repo String - the name of the repo Yes /{org}/{repo}/messages/{messageType} Returns the messages for a given repo Name Description Required org String - the org for the repo Yes repo String - the name of the repo Yes messageType String - the type of message to return Yes startTime start time to search for in Seconds. Defaults to 0 No endTime end time to search for in Seconds. Defaults to now No
Github Dataset
- Dataset will contain two stores: a Table to hold the raw messages and a Cube to hold the metrics.
- As the raw data is written to the Table store, the metrics in the Cube will be updated as needed
- The JSON message is first flattened and then each value inserted as a column in the Table. A final field called rawPayload is also written to capture the full payload.
- The key to the table will be <messageType>-<timestampInSeconds>-<X-GitHub-Delivery>. This will allow scanning by message and by time.
- The Cube will have the following properties
- Resolutions: 60,3600,86400,604800
- Dimensions:
- repository
- message_type
- repository, message_type