Good practices with version control systems

2024/06/03

Abstract

This document is intended to capture good practices for working with version control systems.


Table of Contents

1. Writing good commit messages
1.1. Why are good commit messages important?
1.2. What is a good commit message?
1.3. What is the form of a good commit message?
1.4. How to make sure your project has good commit messages
2. Managing Git history
2.1. Maintaining a ”clean history”

There are actually two aspects to that question: the first is, “why documenting your changes is important?” and the second is, “why is it important to do that in commit messages?”

Because a change is made only once, but may need to be studied many times over.

Whenever you make some change to a project, at the time you are making the change, the reason why you are doing it is perfectly obvious to you. After all, you just spent the last couple of hours/days/weeks (depending on how unlucky you were) trying to make that exciting new feature work or fixing that goddamn bug – you know very well what you are doing and why.

Thing is, you are in fact the only person on Earth to know that. Nobody else has done your work implementing that feature or figuring out that bug. What is painstakingly obvious to you is likely to be a complete mystery to everybody else.

Because commit messages are the only things that are guaranteed to exist for the lifetime of the project, regardless of where the project is hosted and how it is managed.

Commit messages are stored directly within the version control system (VCS, e.g. Git, Mercurial, etc.) that the project is using. Whenever you clone a repository on your local machine, you automatically get all the commits that make up the history of the project (or, at least, the history of the master branch), along with all the associated commit messages. This is true regardless of the repository hosting service (e.g. GitHub, GitLab, SourceForge, etc.) that is used to host the repository. This means that: (1) those commit messages will always exist for as long as even a single copy of the repository exists somewhere; (2) those commit messages are always available to you at anytime, even if you happen to be working “offline” for any reason (for example, when aboard a train). Therefore, commit messages are the place of choice to record anything that is worth knowing about the changes made to a project.

Bug tracker tickets, pull requests, wikis, forums, or any similar features that may be offered by the repository hosting service are, by contrast: (1) only available if you are online; (2) tied to the hosting service. If the project decides to move from one hosting service provider to another, it may be possible to transfer bug tracker tickets or similar items to the new provider, but this is absolutely not guaranteed; this is entirely dependent on the good will of both the provider that the project is leaving and the provider that the project is moving to. Even if the project you are contributing to currently has no intention of ever changing its hosting service provider, it may very well end up not having a choice. Providers come and go and just because a given provider has existed for a long time, it does not mean that it will always exist and/or that it will always be willing to host your project (especially if, as in many cases, you do not actually pay for the service). In short, whatever happens in a bug tracker ticket should always be considered volatile – something that could disappear anytime or not be available, even temporarily, whenever you need it. Good for discussing an issue (e.g. ponder the pros and cons of different possible solutions to the issue, ironing out the details on how to implement a requested feature, etc.), but not a good place to store long-term information.

A good commit message is a message that explains: (1) what you are trying to do in the commit, and (2) why you are doing it.

Explaining what you are trying to do is important for two reasons. First, it gives whoever is reading your commit a chance to check whether your actual change (as shown by the “diff” of the commit itself) indeed does what it was supposed to do. Basically, the commit message is “here is what I wanted to do”, and the diff is “here is how I did it”. If you don’t explain what the commit is supposed to do, the reader will have to figure that out by reading the diff, but then they will have no way of knowing whether what the commit actually does is really what it was meant to do. This is especially important in the context of a pull request review – when the person who is reading your commit message is the reviewer who will decide whether to approve your PR.

Second, it provides context that may not be available in the diff itself, but that may be useful to understand the overall commit. Keep in mind that the reader may not have the entire codebase in their head (this is true even if the reader happens to be the chief maintainer of the project you are contributing to – for any large enough project, no one can be familiar enough with the entire codebase that they will instantly recognise the lines affected by the diff and immediately know what the change is all about).

Explaining why you are doing what you do is important because that’s something that is typically not inferrable at all from the diff. Are you fixing a bug? What was the bug? Are you improving something? What are you improving? Speed? Memory consumption? Are you implementing a new feature? What feature is it? Is it a feature that was requested specifically by someone? Are you scratching one of your itches?

There is no “correct” length for a commit message. Some changes may require more than 50 lines to be fully explained, while some other changes will need no more than one sentence of 2 or 3 lines. Exercise your judgement, and do not try to be excessively verbose to “fill” a message that otherwise would seem too short to you, or conversely to be excessively concise (losing some precious bits of information in the process) to shorten a message that otherwise would seem too long.

Sometimes, what a commit does and even why it does may be entirely evident from the diff itself, in which case there may be no need to elaborate in the commit message. Typical examples are documentation changes or typo fixes. In such cases it is OK for a commit message to be composed of only the header line (which may simply be “documentation update” or “typo fix”), without an actual body.

Note that even when a change is evident from the diff, there may still be things worth mentioning in the commit message. For example, let’s say you reorganised an entire chapter of a documentation (moving sections and paragraph around to make the text flow better): it may be useful to explicitly state that you merely reorganised the contents of the chapter without introducing meaningful changes in the contents (e.g. you didn’t add or remove any section). This will let your fellow contributors know immediately what to expect, without forcing them to scan the entire diff to realise that, “oh, OK, apparently they just moved things around here”.

A particular case is when you did some kind of “automated” change – for example, you used a tool to automatically apply some standard formatting rules to an entire file (or maybe even the entire codebase); in that case, it is advised to include in the commit message the precise command used to invoke the tool that performed the change.

Many guidelines about how to write good commit messages focus on the form, with recommendations such as “capitalise the first word”, “use imperative mood”, or “specify the type of commit [with the use of agreed upon keywords]” and so on.

Such guidelines completely miss the point of what a good commit message is about.

What matters in a commit message is the contents, not the form. A badly formatted message that explains what you did and why you did it will always be better than a perfectly formatted message that does not explain anything useful.

The only case where you should worry about the form is if the project you are contributing to happens to have its own guidelines mandating a specific format for the commit messages – in which case you should obviously try to follow those guidelines as much as possible. Many projects don’t have specific guidelines though, and in their absence you should feel free to write your messages in whatever way you like – just be sure to explain what needs to be explained.

Just use whatever style you want (unless, again, you’re contributing to a project that mandates a given style). You can use a direct, first-person account (“I noticed the variable foo was not properly initialised, and that it could lead to a NullPointerException in some specific conditions. So here I initialise the variable in the constructor of Bar.”). If you are not comfortable using the first person, you can use the passive voice instead (“a proper initialisation procedure was added to ensure that no variables could ever be used without being initialised”). Or you can use the imperative mood (“add a proper initialisation procedure for foo”), or some noun forms (“addition of a proper initialisation procedure”). Or a mix of all that.

For what it's worth, the author of those lines tends to favour the first person plural (“we introduce a new option to allow bla bla bla”), even when he is the only person writing the commit – that is known in some circles as the “royal we”, and is also fine. (He also likes speaking about himself in the third person, apparently.)

Similarly, there is no inherently “good” order for how you explain your commit. You can start by first describing “why” a change was needed (what was the problem in the existing code), and then describe “what” the change is. Or you can do the opposite (“we do bla bla bla. This is because bla bla”). Whatever suits you.

You are allowed to be creative. Feel free to write a commit message as a series of haikus if it would amuse you. As long as you explain what you did and why.

This section is mostly targeted at the maintainers of a project.

To understand what is meant by “clean history”, let’s imagine the following scenario:

Overall, this is a fairly common story, the likes of which happen everyday in the free software world.

At this point, just before being merged, Bob’s PR contains 10 commits:

If Alice merges the PR as it is, all those 10 commits will be added to the tip of the master branch, along with an 11th commit that represents the merge operation itself.

The idea behind the notion of a “clean history” is that the definitive history of the project’s changes (represented by the history of the master branch) should not contain all those commits, since they represent “trials and errors” steps that ultimately “pollute” the history. An “idealised” version of the history should only contain:

There is no consensus on that question, and this guide does not take a firm position.

Proponents of the “clean history” idea posit that the readability of the project’s history is paramount, and that including in it “intermediate” changes that ultimately didn’t make the cut (such as commits #1, #2, and #3 in the example above), changes that were only necessary because of the time it took for the PR to be approved (merge commit #8), or changes that are split over several commits for no good reason (commits #7 and #9 – the second one adds a test that should really have been added already in #7), makes the history needlessly more complicated, and therefore less useful, than it should be.

By contrast, opponents to the “clean history” idea posit that the integrity of the project’s history is paramount, and that the history should merely give an accurate account of what really happened. Software development is messy, trials and errors are part of the process, and it is perfectly normal for the project’s history to reflect that. Some even argue that attempts to clean the history are dishonest and are intended to make the project’s developers look smarter than they actually are, by pretending that there is never any “hiccup” in the project’s development.

The two views are fundamentally incompatible, so it’s up to each project to decide, in agreement with their regular contributors, whether they want to try maintaining a clean history or not.

A couple of objective points, however:

If you decide that your project should have a clean history, there is really only one thing for you to do: you must only accept PRs that only contain the commits you want to see in your history.

That is, when a PR has been through some cycles of reviews, and as a result contains commits that “correct” previous commits in the same PR, you must not merge that PR. Instead, once the reviews are done and the PR has been brought to a state that you are happy with (all corrections implemented, all comments addressed), you must close that PR and ask its author to submit another PR – one that will, from the start, take into account all the comments made on the original PR.

You may have to explain tactfully to the author why submitting another PR is necessary, but usually, based on the experience of projects that have a “clean history policy” in place, the request to submit a clean PR is well received by contributors.