Home GADGETS Highlights from Git 2.47 – The GitHub Blog

Highlights from Git 2.47 – The GitHub Blog

Highlights from Git 2.47 – The GitHub Blog

The open source Git project just released Git 2.47 with features and bug fixes from over 83 contributors, 28 of them new. We last caught up with you on the latest in Git back when 2.46 was released.

To celebrate this most recent release, here is GitHub’s look at some of the most interesting features and changes introduced since last time.

Incremental multi-pack indexes

Returning readers of this series will no doubt remember our coverage of all things related to multi-pack indexes (MIDXs). If you’re new here, or could use a refresher, here’s a brief recap.

Git stores objects (the blobs, trees, commits, and tags that make up your repository’s contents) in one of two formats: either loose or packed. Loose objects are the individual files stored in the two-character sub-directories of $GIT_DIR/objectseach representing a shard of the total set of loose objects. For instance, the object 08103b9f2b6e7fbed517a7e268e4e371d84a9a10 would be stored loose at $GIT_DIR/objects/08/103b9f2b6e7fbed517a7e268e4e371d84a9a10.

Objects can also be packed together in a single file known as a packfile. Packfiles store multiple objects together in a binary format, which has a couple of advantages over storing objects loose. Packfiles often have better cache locality because similar objects are often packed next to or near each other. Packfiles also have the advantage of being able to represent objects as deltas of one another, enabling a more compact representation of pairs of similar objects.

However, repositories can start to experience poor performance when they accumulate many packfiles, since Git has to search through each packfile to perform every object lookup. To improve performance when a repository accumulates too many packs, a repository must repack to generate a single new pack which contains the combined contents of all existing packs. This leaves the repository with only a single pack (resulting in faster lookup times), but the cost of generating that pack can be expensive.

In Git 2.21, multi-pack indexes were introduced to mitigate this expense. MIDXs are an index mapping between objects to the pack and location within that pack at which they appear. Because MIDXs can store information about objects across multiple packs, they enable fast object lookups for repositories that have many individual packs, like so:

Highlights from Git 2.47 – The GitHub Blog

Here the multi-pack index is shown as a series of colored rectangles, each representing an object. The arrows point to those objects’ location within the pack from which they were selected in the MIDX, and encode the information stored in the MIDX itself.

But generating and updating the repository’s MIDX takes time, too: each object in the packs which are part of the MIDX need to be examined to record their object ID and offset within their source pack. This time can stretch even further if you are using multi-pack reachability bitmaps, since it adds a potentially large number of traversals covering significant portions of the repository to the runtime.

So what is there to do? Repacking your repository to optimize object lookups can be slow, but so can updating your repository’s multi-pack index.

Git 2.47 introduces a new experimental feature known as incremental multi-pack indexes, which allow storing more than one multi-pack index together in a chain of MIDX layers. Each layer contains packs and objects which are distinct from earlier layers, so the MIDX can be updated quickly via an append operation that only takes time proportional to the new objects being added, not the size of the overall MIDX. Here’s an example:

incremental MIDX with two layers, describing objects in six unique packs)

The first half of the figure is the same as earlier, but the second half shows a new incremental layer in the multi-pack index chain. The objects contained in the MIDX on the second half are unique to the ones on the first half. But note that the source packs which appear in the MIDX on the second half have some overlap with the objects which appear in the MIDX on the first half.

In Git 2.47, the incremental multi-pack index feature is still considered experimental, and doesn’t yet support multi-pack reachability bitmaps. But support for incremental multi-pack bitmaps is currently under review and will hopefully appear in a future release.

(At GitHub, we plan to use incremental multi-pack bitmaps as part of further scaling efforts to support even larger repositories during repository maintenance. When we do, expect a blog post from us covering the details.)

You can experiment with incremental multi-pack indexes by running:

$ git multi-pack-index write --incremental

to add new packs to your repository’s existing MIDX today.

[source]

Quickly find base branches with for-each-ref

Have you ever been working on a branch, or spelunking through a new codebase and wondered to yourself, “what is this branch based on”? It’s a common question, but the answer can be surprisingly difficult to answer with the previously existing tools.

A good approximation for determining what branch was the likely starting point for some commit C is to select the branch which minimizes the first-parent commits which are unique to C. (Here, “first parent commits” are the commits which are reachable by only walking through a merge commit’s first parent instead of traversing through all of its parents).

If you’re wondering: “why limit the traversal to the first-parent history?”, the answer is because the first-parent history reflects the main path through history which leads up to a commit. By minimizing the number of unique first-parent commits among a set of candidate base branches, you are essentially searching for the one whose primary development path is closest to commit C. So the branch with the fewest unique first-parent commits is likely where C originated or was branched from.

You might think that you could use something like git rev-list --count --first-parent to count the number of first-parent commits between two endpoints. But that’s not quite the case, since rev-list will remove all commits reachable from the base before returning the unique count.

Git 2.47 introduces a new tool for figuring out which branch was the likely starting point for some commit via a new atom used in for-each-ref‘s --format specification. For example, let’s say I’m trying to figure out which branch name was picked for a topic I worked on upstream.

Naively searching for the set of branches which contain the thing I’m looking for can return many results, for example if my commit was merged and is now contained in many other branches. But the new %(is-base:) atom can produce the right answer:

[source]


  • Git is famously portable and compatible with a wide variety of systems and architectures, including some fairly exotic ones. But until this most recent release, Git has lacked a formal platform support policy.

This release includes a new “Platform Support Policy” document which outlines Git’s official policy on the matter. The exact details can be found in the source link below, but the current gist is that platforms must have C99 or C11, use versions of dependencies which are stable or have long-term support, and must have an active security support system. Discussions about adding additional requirements, including possibly depending upon Rust in a future version, are ongoing.

The policy also has suggestions for platform maintainers on which branches to test and how to report and fix compatibility issues.

[source]

  • A couple of releases ago, we discussed Git’s preliminary support for a new reference backend known as reftable. If you’re fuzzy on the details, our previous post is chock full of them.

This release brings a number of unit tests which were written in the reftable implementation’s custom testing framework to Git’s standard unit test framework. These migrations were done by Chandra Pratap, one of the Git project’s Google Summer of Code (GSoC) contributors.

This release also saw reftable gain better support when dealing with concurrent writers, particularly during stack compaction. The reftable backend also gained support for git for-each-ref’s –exclude option which we wrote about when Git 2.42 was released.

[source, source, source, source, source, source, source, source, source, source, source, source]

  • While we’re on the topic of unit testing, there were a number of other areas of the project which received more thorough unit test coverage, or migrated over existing test from Git’s Shell-based integration test suite.

Git’s hashmap API, OID array, and urlmatch normalization features all were converted from Shell-based tests with custom helpers to unit tests. The unit test framework itself also received significant attention, ultimately resulting in using the Clar framework, which was originally written to replace the unit test framework in libgit2.

Many of these unit test conversions were done by Ghanshyam Thakkar, another one of Git’s GSoC contributors. Congratulations, Ghanshyam!

[source, source, source, source, source, source, source]

  • While we’re on the topic of Google Summer of Code contributors, we should mention last (but not least!) another student, shejialuo, improved git fsck to check the reference storage backend for integrity in addition to the regular object store. They introduced a new git refs verify sub-command which is run through via git fsckand catches many reference corruption issues.

[source]

  • Since at least 2019, there has been an effort to find and annotate unused parameters in functions across Git’s codebase. Annotating parameters as unused can help identify better APIs, and often the presence of an unused parameter can point out a legitimate bug in that function’s implementation.

For many years, the Git project has sought to compile with -Wunused-parameter under its special DEVELOPER=1 mode, making it a compile-time error to have or introduce any unused parameters across the codebase. During that time, there have been many unused parameter cleanups and bug fixes, all done while working around other active development going on in related areas.

In this release, that effort came to a close. Now when compiling with DEVELOPER=1it is now a compile-time error to have unused parameters, making Git’s codebase cleaner and safer going forward.

[source, source, source, source, source, source, source, source]

  • Way back when Git 2.34 was released, we covered a burgeoning effort to find and fix memory leaks throughout the Git codebase. Back then, we wrote that since Git typically has a very short runtime, it is much less urgent to free memory than it is in, say, library code, since a process’s memory will be “freed” by the operating system when the process stops.

But as Git internals continue to be reshaped with the eventual goal of having them be call-able as a first party library, plugging any memory leaks throughout the codebase is vitally important.

That effort has continued in this release, with more leaks throughout the codebase being plugged. For all of the details, check out the source links below:

[source, source, source, source]

  • The git mergetool command learned a new tool configuration for Visual Studio Code. While it has always been possible to manually configure Git to run VSCode’s 3-way merge resolution, it required manual configuration.

In Git 2.47, you can now easily configure your repository by running:

$ git config set merge.tool vscode

and subsequent runs of git mergetool will automatically open VSCode in the correct configuration.

[source]

The rest of the iceberg

That’s just a sample of changes from the latest release. For more, check out the release notes for 2.47, or any previous version in the Git repository.

Written by

Taylor Blau

Taylor Blau is a Staff Software Engineer at GitHub where he works on Git.

Source link