Home GADGETS Highlights from Git 2.50 – The GitHub Blog

Highlights from Git 2.50 – The GitHub Blog

Highlights from Git 2.50 – The GitHub Blog

The open source Git project just released Git 2.50 with features and bug fixes from 98 contributors, 35 of them new. We last caught up with you on the latest in Git back when 2.49 was released.

💡 Before we get into the details of this latest release, we wanted to remind you that Git Merge, the conference for Git users and developers is back this year on September 29-30, in San Francisco. Git Merge will feature talks from developers working on Git, and in the Git ecosystem. Tickets are on sale now; check out the website to learn more.

With that out of the way, let’s take a look at some of the most interesting features and changes from Git 2.50.

Improvements for multiple cruft packs

When we covered Git 2.43, we talked about newly added support for multiple cruft packs. Git 2.50 improves on that with better command-line ergonomics, and some important bugfixes. In case you’re new to the series, need a refresher, or aren’t familiar with cruft packs, here’s a brief overview:

Git objects may be either reachable or unreachable. The set of reachable objects is everything you can walk to starting from one of your repository’s references: traversing from commits to their parent(s), trees to their sub-tree(s), and so on. Any object that you didn’t visit by repeating that process over all of your references is unreachable.

In Git 2.37, Git introduced cruft packs, a new way to store your repository’s unreachable objects. A cruft pack looks like an ordinary packfile with the addition of an .mtimes file, which is used to keep track of when each object was most recently written in order to determine when it is safe1 to discard it.

However, updating the cruft pack could be cumbersome–particularly in repositories with many unreachable objects–since a repository’s cruft pack must be rewritten in order to add new objects. Git 2.43 began to address this through a new command-line option: git repack --max-cruft-size. This option was designed to split unreachable objects across multiple packs, each no larger than the value specified by --max-cruft-size. But there were a couple of problems:

  • If you’re familiar with git repack’s --max-pack-size option, --max-cruft-size’s behavior is quite confusing. The former option specifies the maximum size an individual pack can be, while the latter involves how and when to move objects between multiple packs.
  • The feature was broken to begin with! Since --max-cruft-size also imposes on cruft packs the same pack-size constraints as --max-pack-size does on non-cruft packs, it is often impossible to get the behavior you want.

For example, suppose you had two 100 MiB cruft packs and ran git repack --max-cruft-size=200M. You might expect Git to merge them into a single 200 MiB pack. But since --max-cruft-size also dictates the maximum size of the output pack, Git will refuse to combine them, or worse: rewrite the same pack repeatedly.

Git 2.50 addresses both of these issues with a new option: --combine-cruft-below-size. Instead of specifying the maximum size of the output pack, it determines which existing cruft pack(s) are eligible to be combined. This is particularly helpful for repositories that have accumulated many unreachable objects spread across multiple cruft packs. With this new option, you can gradually reduce the number of cruft packs in your repository over time by combining existing ones together.

With the introduction of --combine-cruft-below-size, Git 2.50 repurposed --max-cruft-size to behave as a cruft pack-specific override for --max-pack-size. Now --max-cruft-size only determines the size of the outgoing pack, not which packs get combined into it.

Along the way, a bug was uncovered that prevented objects stored in multiple cruft packs from being “freshened” in certain circumstances. In other words, some unreachable objects don’t have their modification times updated when they are rewritten, leading to them being removed from the repository earlier than they otherwise would have been. Git 2.50 squashes this bug, meaning that you can now efficiently manage multiple cruft packs and freshen their objects to your heart’s content.

[source, source]

Incremental multi-pack reachability bitmaps

​​Back in our coverage of Git 2.47, we talked about preliminary support for incremental multi-pack indexes. Multi-pack indexes (MIDXs) act like a single pack *.idx file for objects spread across multiple packs.

Multi-pack indexes are extremely useful to accelerate object lookup performance in large repositories by binary searching through a single index containing most of your repository’s contents, rather than repeatedly searching through each individual packfile. But multi-pack indexes aren’t just useful for accelerating object lookups. They’re also the basis for multi-pack reachability bitmaps, the MIDX-specific analogue of classic single-pack reachability bitmaps. If neither of those are familiar to you, don’t worry; here’s a brief refresher. Single-pack reachability bitmaps store a collection of bitmaps corresponding to a selection of commits. Each bit position in a pack bitmap refers to one object in that pack. In each individual commit’s bitmap, the set bits correspond to objects that are reachable from that commit, and the unset bits represent those that are not.

Multi-pack bitmaps were introduced to take advantage of the substantial performance increase afforded to us by reachability bitmaps. Instead of having bitmaps whose bit positions correspond to the set of objects in a single pack, a multi-pack bitmap’s bit positions correspond to the set of objects in a multi-pack index, which may include objects from arbitrarily many individual packs. If you’re curious to learn more about how multi-pack bitmaps work, you can read our earlier post Scaling monorepo maintenance.

However, like cruft packs above, multi-pack indexes can be cumbersome to update as your repository grows larger, since each update requires rewriting the entire multi-pack index and its corresponding bitmap, regardless of how many objects or packs are being added. In Git 2.47, the file format for multi-pack indexes became incremental, allowing multiple multi-pack index layers to be layered on top of one another forming a chain of MIDXs. This made it much easier to add objects to your repository’s MIDX, but the incremental MIDX format at the time did not yet have support for multi-pack bitmaps.

Git 2.50 brings support for the multi-pack reachability format to incremental MIDX chains, with each MIDX layer having its own *.bitmap file. These bitmap layers can be used in conjunction with one another to provide reachability information about selected commits at any layer of the MIDX chain. In effect, this allows extremely large repositories to quickly and efficiently add new reachability bitmaps as new commits are pushed to the repository, regardless of how large the repository is.

This feature is still considered highly experimental, and support for repacking objects into incremental multi-pack indexes and bitmaps is still fairly bare-bones. This is an active area of development, so we’ll make sure to cover any notable developments to incremental multi-pack reachability bitmaps in this series in the future.

[source]

The ORT merge engine replaces recursive

This release also saw some exciting updates related to merging. Way back when Git 2.33 was released, we talked about a new merge engine called “ORT” (standing for “Ostensibly Recursive’s Twin”).

ORT is a from-scratch rewrite of Git’s old merging engine, called “recursive.” ORT is significantly faster, more maintainable, and has many new features that were difficult to implement on top of its predecessor.

One of those features is the ability for Git to determine whether or not two things are mergeable without actually persisting any new objects necessary to construct the merge in the repository. Previously, the only way to tell whether two things are mergeable was to run git merge-tree --write-tree on them. That works, but in this example merge-tree wrote any new objects generated by the merge into the repository. Over time, these can accumulate and cause performance issues. In Git 2.50, you can make the same determination without writing any new objects by using merge-tree’s new --quiet mode and relying on its exit code.

Most excitingly in this release is that ORT has entirely superseded recursive, and recursive is no longer part of Git’s source code. When ORT was first introduced, it was only accessible through git merge’s -s option to select a strategy. In Git 2.34, ORT became the default choice over recursive, though the latter was still available in case there were bugs or behavior differences between the two. Now, 16 versions and two and a half years later, recursive has been completely removed from Git, with its author, Elijah Newren, writing:

As a wise man once told me, “Deleted code is debugged code!”

As of Git 2.50, recursive has been completely debugged deleted. For more about ORT’s internals and its development, check out this five part series from Elijah here, here, here, here, and here.

[source, source, source]


  • If you’ve ever scripted around your repository’s objects, you are likely familiar with git cat-file, Git’s purpose-built tool to list objects and print their contents. git cat-file has many modes, like --batch (for printing out the contents of objects), or --batch-check (for printing out certain information about objects without printing their contents).

    Oftentimes it is useful to dump the set of all objects of a certain type in your repository. For commits, git rev-list can easily enumerate a set of commits. But what about, say, trees? In the past, to filter down to just the tree objects from a list of objects, you might have written something like:

    $ git cat-file --batch-check='%(objecttype) %(objectname)' \
        --buffer Git 2.50 brings Git’s object filtering mechanism used in partial clones to git cat-file, so the above can be rewritten a little more concisely like:
    $ git cat-file --batch-check='%(objectname)' --filter="object:type=tree" 
    

    [source]

  • While we’re on the topic, let’s discuss a little-known git cat-file command-line option: --allow-unknown-type. This arcane option was used with objects that have a type other than blob, tree, commit, or tag. This is a quirk dating back a little more than a decade ago that allows git hash-object to write objects with arbitrary types. In the time since, this feature has gotten very little use. In fact, git cat-file -p --allow-unknown-type can’t even print out the contents of one of these objects!
    $ oid="$(git hash-object -w -t notatype --literally /dev/null)"
    $ git cat-file -p $oid
    fatal: invalid object type
    

    This release makes the --allow-unknown-type option silently do nothing, and removes support from git hash-object to write objects with unknown types in the first place.

    [source]

  • The git maintenance command learned a number of new tricks this release as well. It can now perform a few new different kinds of tasks, like worktree-prune, rerere-gc, and reflog-expire. worktree-prune mirrors git gc’s functionality to remove stale or broken Git worktrees. rerere-gc also mirrors existing functionality exposed via git gc to expire old rerere entries from previously recorded merge conflict resolutions. Finally, reflog-expire can be used to remove stale unreachable objects from out of the reflog.

    git maintenance also ships with new configuration for the existing loose-objects task. This task removes lingering loose objects that have since been packed away, and then makes new pack(s) for any loose objects that remain. The size of those packs was previously fixed at a maximum of 50,000, and can now be configured by the maintenance.loose-objects.batchSize configuration.

    [source, source, source]

  • If you’ve ever needed to recover some work you lost, you may be familiar with Git’s reflog feature, which allows you to track changes to a reference over time. For example, you can go back and revisit earlier versions of your repository’s main branch by doing git show main@{2} (to show main prior to the two most recent updates) or main@{1.week.ago} (to show where your copy of the branch was at a week ago).

    Reflog entries can accumulate over time, and you can reach for git reflog expire in the event you need to clean them up. But how do you delete the entirety of a branch’s reflog? If you’re not yet running Git 2.50 and thought “surely it’s git reflog delete”, you’d be wrong! Prior to Git 2.50, the only way to delete a branch’s entire reflog was to do git reflog expire $BRANCH --expire=all.

    In Git 2.50, a new delete sub-command was introduced, so you can accomplish the same as above with the much more natural git reflog delete $BRANCH.

    [source]

  • Speaking of references, Git 2.50 also received some attention to how references are processed and used throughout its codebase. When using the low-level git update-ref command, Git used to spend time checking whether or not the proposed refname could also be a valid object ID, making its lookups ambiguous. Since update-ref is such a low-level command, this check is no longer done, delivering some performance benefits to higher-level commands that rely on update-ref for their functionality.

    Git 2.50 also learned how to cache whether or not any prefix of a proposed reference name already exists (for example, you can’t create a reference ref/heads/foo/bar/baz if either refs/heads/foo/bar or refs/heads/foo already exists).

    Finally, in order to make those checks, Git used to create a new reference iterator for each individual prefix. Git 2.50’s reference backends learned how to “seek” existing iterators, saving time by being able to reuse the same iterator when checking each possible prefix.

    [source]

  • If you’ve ever had to tinker with Git’s low-level curl configuration, you may be familiar with Git’s configuration options for tuning HTTP connections, like http.lowSpeedLimit and http.lowSpeedTime which are used to terminate an HTTP connection that is transferring data too slowly.

    These options can be useful when fine-tuning Git to work in complex networking environments. But what if you want to tweak Git’s TCP Keepalive behavior? This can be useful to control when and how often to send keepalive probes, as well as how many to send, before terminating a connection that hasn’t sent data recently.

    Prior to Git 2.50, this wasn’t possible, but this version introduces three new configuration options: http.keepAliveIdle, http.keepAliveInterval, and http.keepAliveCount which can be used to control the fine-grained behavior of curl’s TCP probing (provided your operating system supports it).

    [source]

  • Git is famously portable and runs on a wide variety of operating systems and environments with very few dependencies. Over the years, various parts of Git have been written in Perl, including some commands like the original implementation of git add -i . These days, very few remaining Git commands are written in Perl.

    This version reduces Git’s usage of Perl by removing it as a dependency of the test suite and documentation toolchain. Many Perl one-liners from Git’s test suite were rewritten to use other Shell functions or builtins, and some were rewritten as tiny C programs. For the handful of remaining hard dependencies on Perl, those tests will be skipped on systems that don’t have a working Perl.

    [source, source]

  • This release also shipped a minor cosmetic update to git rebase -i. When starting a rebase, your $EDITOR might appear with contents that look something like:

    pick c108101daa foo
    pick d2a0730acf bar
    pick e5291f9321 baz
    

    You can edit that list to break, reword, or exec (among many others), and Git will happily execute your rebase. But if you change the commit message in your rebase’s TODO script, they won’t actually change!

    That’s because the commit messages shown in the TODO script are just meant to help you identify which commits you’re rebasing. (If you want to rewrite any commit messages along the way, you can use the reword command instead). To clarify that these messages are cosmetic, Git will now prefix them with a # comment character like so:

    pick c108101daa # foo
    pick d2a0730acf # bar
    pick e5291f9321 # baz
    

    [source]

  • Long time readers of this series will recall our coverage of Git’s bundle feature (when Git added support for partial bundles), though we haven’t covered Git’s bundle-uri feature. Git bundles are a way to package your repositories contents: both its objects and the references that point at them into a single *.bundle file.

    While Git has had support for bundles since as early as v1.5.1 (nearly 18 years ago!), its bundle-uri feature is much newer. In short, the bundle-uri feature allows a server to serve part of a clone by first directing the client to download a *.bundle file. After the client does so, it will try to perform a fill-in fetch to gather any missing data advertised by the server but not part of the bundle.

    To speed up this fill-in fetch, your Git client will advertise any references that it picked up from the *.bundle itself. But in previous versions of Git, this could sometimes result in slower clones overall! That’s because up until Git 2.50, Git would only advertise the branches in refs/heads/* when asking the server to send the remaining set of objects.

    Git 2.50 now includes advertises all references it knows about from the *.bundle when doing a fill-in fetch on the server, making bundle-uri-enabled clones much faster.

    For more details about these changes, you can check out this blog post from Scott Chacon.

    [source]

  • Last but not least, git add -p (and git add -i) now work much more smoothly in sparse checkouts by no longer having to expand the sparse index. This follows in a long line of work that has been gradually adding sparse-index compatibility to Git commands that interact with the index.

    Now you can interactively stage parts of your changes before committing in a sparse checkout without having to wait for Git to populate the sparsified parts of your repository’s index. Give it a whirl on your local sparse checkout today!

    [source]


The rest of the iceberg

That’s just a sample of changes from the latest release. For more, check out the release notes for 2.50, or any previous version in the Git repository.

🎉 Git turned 20 this year! Celebrate by watching our interview of Linus Torvalds, where we discuss how it forever changed software development.

Written by

Highlights from Git 2.50 – The GitHub Blog

Taylor Blau is a Staff Software Engineer at GitHub where he works on Git.



Source link