As part of the University of Wisconsin Madison’s Open Source Program Office’s investigation into open source activities around the state, we collected and analyzed a set of information from GitHub and GitLab that yields some insights into the general state of code and repository quality.
The UW OSPO opened in the fall of 2023 and one of our first tasks was to try to understand the current state of open source activity at the University of Wisconsin and around the state (in accordance with the Wisconsin Idea). To aid in this task, we built a set scripts to utilize the GitHub and GitLab APIs to download and compile metadata about relevant repositories.
For the GitHub repositories, we simply looked for repositories that contained the term “Wisconsin” somewhere in the description. This yielded 3033 repositories to examine.
For GitLab projects, the University of Wisconsin hosts its own GitLab instance which students and staff are encouraged to use and so we were able to examine all of the projects in this database. This yielded 2579 projects.
The code and findings from this investigation are available at the following url: https://github.com/UW-Madison-DSI/UW-Open-Source-Exploration
The most important thing to note in the data was actually what was missing from most repositories. These results are summarized in the following charts: https://projects.dsi.wisc.edu/amegahed/open-source
We found the following GitHub statistics:
We have been considering a number of possible ways that we can address the code quality problem and ways that we can help people to create better repositories.
We can create a set of instructional materials that would describe the importance of each of the repository elements and serve as a guide for building quality repositories.
Sometimes people learn best by example so the OSPO is working to build a curated set of showcase repositories that can demonstrate what a high quality and complete repository looks like in practice.
There are a number of examples of README templates, but there are differences of opinion on the essential elements and preferred formatting. We could create our own OSPO README template that reflects our own thoughts and experience.
The OSPO could potentially hold classes or workshops in repository creation. The University of Wisconsin’s Data Science Hub also conducts classes on how to use Git which could potentially include a section about repository quality.
We could potentially create an automated evaluation tool that would examine a particular repository and make suggestions for improvement. This might also be an opportunity to explore using AI tools to parse the README text to help with the process of creating meaningful automated suggestions.
Given the fact that so many repositories are lacking in the most basic and fundamental elements, the UW Open Source Program Office has an opportunity to make a significant difference in the code quality of current repositories.