University of Wisconsin–Madison

A Look Into GitHub / GitLab Repository Quality Metrics

Abe Megahed, University of Wisconsin Data Science Institute

Summary

As part of the University of Wisconsin Madison’s Open Source Program Office’s investigation into open source activities around the state, we collected and analyzed a set of information from GitHub and GitLab that yields some insights into the general state of code and repository quality.

Methodology

The UW OSPO opened in the fall of 2023 and one of our first tasks was to try to understand the current state of open source activity at the University of Wisconsin and around the state (in accordance with the Wisconsin Idea). To aid in this task, we built a set scripts to utilize the GitHub and GitLab APIs to download and compile metadata about relevant repositories.

  • For the GitHub repositories, we simply looked for repositories that contained the term “Wisconsin” somewhere in the description. This yielded 3033 repositories to examine.
  • For GitLab projects, the University of Wisconsin hosts its own GitLab instance which students and staff are encouraged to use and so we were able to examine all of the projects in this database. This yielded 2579 projects.

The code and findings from this investigation are available at the following url: https://github.com/UW-Madison-DSI/UW-Open-Source-Exploration

Findings

The most important thing to note in the data was actually what was missing from most repositories. These results are summarized in the following charts: https://projects.dsi.wisc.edu/amegahed/open-source

We found the following GitHub statistics:

  • Only 72% of GitHub repositories and 84% of GitLlab projects included a README file.
  • Only 14% of GitHub repositories and 10% of GitLab projects included a LICENSE file.
  • Only 5% of repositories included a homepage. More serious projects have a homepage dedicated to actual users of the software rather than just developers.
  • Only 8% of GitHub repositories and 0.3% of GitLab projects included a README containing images. While some form of graphics or illustrations are not technically required of READMEs, they are a fairly reliable indicator of whether or not the authors have considered the needs of the audience.
  • 80% of Github repositories and 50% of GitLab projects included a basic description.
  • Only 0.5% of GitHub repositories (and essentially no GitLab repositories) included all of these essential elements!
GitHub Features Chart
Prevalance of GitHub Repository Features
GitLab Features Chart
Prevalance of GitLab Project Features

Proposed Remediations

We have been considering a number of possible ways that we can address the code quality problem and ways that we can help people to create better repositories.

  1. Educational Materials
    We can create a set of instructional materials that would describe the importance of each of the repository elements and serve as a guide for building quality repositories.
  2. Curated Examples
    Sometimes people learn best by example so the OSPO is working to build a curated set of showcase repositories that can demonstrate what a high quality and complete repository looks like in practice.
  3. Templates
    There are a number of examples of README templates, but there are differences of opinion on the essential elements and preferred formatting. We could create our own OSPO README template that reflects our own thoughts and experience.
  4. Classes or Workshops
    The OSPO could potentially hold classes or workshops in repository creation. The University of Wisconsin’s Data Science Hub also conducts classes on how to use Git which could potentially include a section about repository quality.
  5. Automated Repository Evaluation Tools
    We could potentially create an automated evaluation tool that would examine a particular repository and make suggestions for improvement. This might also be an opportunity to explore using AI tools to parse the README text to help with the process of creating meaningful automated suggestions.

Conclusions

Given the fact that so many repositories are lacking in the most basic and fundamental elements, the UW Open Source Program Office has an opportunity to make a significant difference in the code quality of current repositories.