The performance of git-filter-branch is glacially slow; its
design makes it impossible for a backward-compatible
implementation to ever be fast:
• In editing files, git-filter-branch by design checks out each
and every commit as it existed in the original repo. If your
repo has 10^5
files and 10^5
commits, but each commit only
modifies five files, then git-filter-branch will make you do
10^10
modifications, despite only having (at most) 5*10^5
unique blobs.
• If you try and cheat and try to make git-filter-branch only
work on files modified in a commit, then two things happen
• you run into problems with deletions whenever the user is
simply trying to rename files (because attempting to
delete files that don't exist looks like a no-op; it
takes some chicanery to remap deletes across file renames
when the renames happen via arbitrary user-provided
shell)
• even if you succeed at the map-deletes-for-renames
chicanery, you still technically violate backward
compatibility because users are allowed to filter files
in ways that depend upon topology of commits instead of
filtering solely based on file contents or names (though
this has not been observed in the wild).
• Even if you don't need to edit files but only want to e.g.
rename or remove some and thus can avoid checking out each
file (i.e. you can use --index-filter), you still are passing
shell snippets for your filters. This means that for every
commit, you have to have a prepared git repo where those
filters can be run. That's a significant setup.
• Further, several additional files are created or updated per
commit by git-filter-branch. Some of these are for supporting
the convenience functions provided by git-filter-branch (such
as map()), while others are for keeping track of internal
state (but could have also been accessed by user filters; one
of git-filter-branch's regression tests does so). This
essentially amounts to using the filesystem as an IPC
mechanism between git-filter-branch and the user-provided
filters. Disks tend to be a slow IPC mechanism, and writing
these files also effectively represents a forced
synchronization point between separate processes that we hit
with every commit.
• The user-provided shell commands will likely involve a
pipeline of commands, resulting in the creation of many
processes per commit. Creating and running another process
takes a widely varying amount of time between operating
systems, but on any platform it is very slow relative to
invoking a function.
• git-filter-branch itself is written in shell, which is kind
of slow. This is the one performance issue that could be
backward-compatibly fixed, but compared to the above problems
that are intrinsic to the design of git-filter-branch, the
language of the tool itself is a relatively minor issue.
• Side note: Unfortunately, people tend to fixate on the
written-in-shell aspect and periodically ask if
git-filter-branch could be rewritten in another language
to fix the performance issues. Not only does that ignore
the bigger intrinsic problems with the design, it'd help
less than you'd expect: if git-filter-branch itself were
not shell, then the convenience functions (map(),
skip_commit(), etc) and the --setup
argument could no
longer be executed once at the beginning of the program
but would instead need to be prepended to every user
filter (and thus re-executed with every commit).
The git filter-repo
[1] tool is an alternative to
git-filter-branch which does not suffer from these performance
problems or the safety problems (mentioned below). For those with
existing tooling which relies upon git-filter-branch, git
filter-repo also provides filter-lamely
[2], a drop-in
git-filter-branch replacement (with a few caveats). While
filter-lamely suffers from all the same safety issues as
git-filter-branch, it at least ameliorates the performance issues
a little.