A Look Into GitHub / GitLab Repository Quality Metrics
Abe Megahed, University of Wisconsin Data Science Institute
Summary
As part of the University of Wisconsin Madison’s Open Source Program Office’s investigation into open source activities around the state, we collected and analyzed a set of information from GitHub and GitLab that yields some insights into the general state of code and repository quality.
Methodology
The UW OSPO opened in the fall of 2023 and one of our first tasks was to try to understand the current state of open source activity at the University of Wisconsin and around the state (in accordance with the Wisconsin Idea). To aid in this task, we built a set scripts to utilize the GitHub and GitLab APIs to download and compile metadata about relevant repositories.
- For the GitHub repositories, we simply looked for repositories that contained the term “Wisconsin” somewhere in the description. This yielded 3033 repositories to examine.
- For GitLab projects, the University of Wisconsin hosts its own GitLab instance which students and staff are encouraged to use and so we were able to examine all of the projects in this database. This yielded 2579 projects.
The code and findings from this investigation are available at the following url: https://github.com/UW-Madison-DSI/UW-Open-Source-Exploration
Findings
The most important thing to note in the data was actually what was missing from most repositories. These results are summarized in the following charts: https://projects.dsi.wisc.edu/amegahed/open-source
We found the following GitHub statistics:
- Only 72% of GitHub repositories and 84% of GitLlab projects included a README file.
- Only 14% of GitHub repositories and 10% of GitLab projects included a LICENSE file.
- Only 5% of repositories included a homepage. More serious projects have a homepage dedicated to actual users of the software rather than just developers.
- Only 8% of GitHub repositories and 0.3% of GitLab projects included a README containing images. While some form of graphics or illustrations are not technically required of READMEs, they are a fairly reliable indicator of whether or not the authors have considered the needs of the audience.
- 80% of Github repositories and 50% of GitLab projects included a basic description.
- Only 0.5% of GitHub repositories (and essentially no GitLab repositories) included all of these essential elements!


Proposed Remediations
We have been considering a number of possible ways that we can address the code quality problem and ways that we can help people to create better repositories.
- Educational Materials
We can create a set of instructional materials that would describe the importance of each of the repository elements and serve as a guide for building quality repositories. - Curated Examples
Sometimes people learn best by example so the OSPO is working to build a curated set of showcase repositories that can demonstrate what a high quality and complete repository looks like in practice. - Templates
There are a number of examples of README templates, but there are differences of opinion on the essential elements and preferred formatting. We could create our own OSPO README template that reflects our own thoughts and experience. - Classes or Workshops
The OSPO could potentially hold classes or workshops in repository creation. The University of Wisconsin’s Data Science Hub also conducts classes on how to use Git which could potentially include a section about repository quality. - Automated Repository Evaluation Tools
We could potentially create an automated evaluation tool that would examine a particular repository and make suggestions for improvement. This might also be an opportunity to explore using AI tools to parse the README text to help with the process of creating meaningful automated suggestions.
Conclusions
Given the fact that so many repositories are lacking in the most basic and fundamental elements, the UW Open Source Program Office has an opportunity to make a significant difference in the code quality of current repositories.