The purpose of this page is to document the plan for redesigning the Github statistic collection in the Caskalytics app.
...
Current Implementation of Github Metrics
Use a Workflow Custom Action to run periodic RESTful calls to the Github API
Results will be written into the GitHub partition of the Fileset.
A MapReduce job will periodically read from the GitHub partition of the Fileset, and update the Cube dataset.
...
- Expose a service that accepts and verifies valid webhook messages from Github and writes those messages to a Datatable.
- This will collect both the raw messages as well as a metrics table for collecting stats at a repo and user level
- Expose a RESTful endpoint to query metrics from the aggregates table and return results in JSON
- Use the data service to create some sort of visual display of the information.
Additional Value
- Because we are collecting each message as it happens in Github, we can have a more real-time overview of what's happening in our Org (data updates faster)
- Setting this at the Org level means we don't have to configure it for each repo. As new repos are added, their data will start showing up in our metrics (decreased maintenance)
- Since we have the raw messages stored, we can reprocess them at any time to extract additional metrics that we may not know we need yet (future-proof)
- No additional map-reduce job needed since we are collecting metrics as they happen (simplify)
Metrics Calculated
- Metrics will be stored in a seperate dataset from the raw messages
- Repo messages will overwrite each time a new message is received from Github
- All Time Metrics
- Per Repository
- repo size
- stargazers_count
- watchers_count
- forks_count
- total pull requests
- Per Message
- count
- Per Repo / Per Message
- count
- Per Sender
- Per Repository
...