GitHub Batch Source
- Anton Kai
- Prateek Duble
GitHub provides hosting for software development version control using Git. This plugin would allow users to select the data sets associated with the specified repository and collect raw level data.
User Expectations
- Users would like to collect raw data sets associated with a specific repository so that they can perform monitoring and reporting on it
- User would like to perform aggregations on GitHub datasets so that they can get better understanding of the repository usage
Plugin Type
- Batch Source
- Batch Sink
- Real-time Source
- Real-time Sink
- Action
- Post-Run Action
- Aggregate
- Join
- Spark Model
- Spark Compute
User Configurations
User Configuration Label | Label Description | Variable | User Widget | Notes |
---|---|---|---|---|
Access Token | Authorization token to be used to authenticate to GitHub API | authorizationToken | Text Box | https://developer.github.com/v3/#authentication |
Repository name | Repository name from which the data is retrieved | repoName | Text Box | |
Repository owner name | GitHub username who owns the repository from which the data is retrieved | repoOwner | Text Box | |
GitHub API hostname | GitHub API hostname from which the data is retrieved. | hostname | Text Box | Optional, for GitHub Enterprise only. By default, api.github.com |
Dataset* | Dataset name that you would like to retrieve** | dataset_name | Drop down | https://developer.github.com/v3/repos/ Valid values include all the objects listed in the above link. |
* Dataset name can be one of the following: Branches, Collaborators, Comments, Commits, Contents, Deploy Keys, Deployments, Forks, Invitations, Pages, Releases, Traffic:Referrers, Webhooks)
** Retrieving GitHub data would always call list API for the associated object. For instance, if Collaborators dataset was selected, the plugin would get the list of all the collaborators on the specified repository (along with other associated fields returned by List Collaborators API)
Design / Implementation Tips
Authentication will be performed using access token.
Output schema must be automatically generated from selected data.
References
- GitHub API v3 documentation: https://developer.github.com/v3/
- GitHub API v4 (GraphQL) documentation: https://developer.github.com/v4/
- GitHub API v3 third-part libraries: https://developer.github.com/v3/libraries/
Table of Contents
Checklist
- User stories documented
- User stories reviewed
- Design documented
- Design reviewed
- Feature merged
- Examples and guides
- Integration tests
- Documentation for feature
- Short video demonstrating the feature