git-filter-branch is riddled with gotchas resulting in various
ways to easily corrupt repos or end up with a mess worse than
what you started with:
• Someone can have a set of "working and tested filters" which
they document or provide to a coworker, who then runs them on
a different OS where the same commands are not working/tested
(some examples in the git-filter-branch manpage are also
affected by this). BSD vs. GNU userland differences can
really bite. If lucky, error messages are spewed. But just as
likely, the commands either don't do the filtering requested,
or silently corrupt by making some unwanted change. The
unwanted change may only affect a few commits, so it's not
necessarily obvious either. (The fact that problems won't
necessarily be obvious means they are likely to go unnoticed
until the rewritten history is in use for quite a while, at
which point it's really hard to justify another flag-day for
another rewrite.)
• Filenames with spaces are often mishandled by shell snippets
since they cause problems for shell pipelines. Not everyone
is familiar with find -print0, xargs -0, git-ls-files -z,
etc. Even people who are familiar with these may assume such
flags are not relevant because someone else renamed any such
files in their repo back before the person doing the
filtering joined the project. And often, even those familiar
with handling arguments with spaces may not do so just
because they aren't in the mindset of thinking about
everything that could possibly go wrong.
• Non-ascii filenames can be silently removed despite being in
a desired directory. Keeping only wanted paths is often done
using pipelines like git ls-files | grep -v ^WANTED_DIR/ |
xargs git rm
. ls-files will only quote filenames if needed,
so folks may not notice that one of the files didn't match
the regex (at least not until it's much too late). Yes,
someone who knows about core.quotePath can avoid this (unless
they have other special characters like \t, \n, or "), and
people who use ls-files -z with something other than grep can
avoid this, but that doesn't mean they will.
• Similarly, when moving files around, one can find that
filenames with non-ascii or special characters end up in a
different directory, one that includes a double quote
character. (This is technically the same issue as above with
quoting, but perhaps an interesting different way that it can
and has manifested as a problem.)
• It's far too easy to accidentally mix up old and new history.
It's still possible with any tool, but git-filter-branch
almost invites it. If lucky, the only downside is users
getting frustrated that they don't know how to shrink their
repo and remove the old stuff. If unlucky, they merge old and
new history and end up with multiple "copies" of each commit,
some of which have unwanted or sensitive files and others
which don't. This comes about in multiple different ways:
• the default to only doing a partial history rewrite
(--all is not the default and few examples show it)
• the fact that there's no automatic post-run cleanup
• the fact that --tag-name-filter (when used to rename
tags) doesn't remove the old tags but just adds new ones
with the new name
• the fact that little educational information is provided
to inform users of the ramifications of a rewrite and how
to avoid mixing old and new history. For example, this
man page discusses how users need to understand that they
need to rebase their changes for all their branches on
top of new history (or delete and reclone), but that's
only one of multiple concerns to consider. See the
"DISCUSSION" section of the git filter-repo manual page
for more details.
• Annotated tags can be accidentally converted to lightweight
tags, due to either of two issues:
• Someone can do a history rewrite, realize they messed up,
restore from the backups in refs/original/, and then redo
their git-filter-branch command. (The backup in
refs/original/ is not a real backup; it dereferences tags
first.)
• Running git-filter-branch with either --tags or --all in
your <rev-list options>. In order to retain annotated
tags as annotated, you must use --tag-name-filter (and
must not have restored from refs/original/ in a
previously botched rewrite).
• Any commit messages that specify an encoding will become
corrupted by the rewrite; git-filter-branch ignores the
encoding, takes the original bytes, and feeds it to
commit-tree without telling it the proper encoding. (This
happens whether or not --msg-filter is used.)
• Commit messages (even if they are all UTF-8) by default
become corrupted due to not being updated — any references to
other commit hashes in commit messages will now refer to
no-longer-extant commits.
• There are no facilities for helping users find what unwanted
crud they should delete, which means they are much more
likely to have incomplete or partial cleanups that sometimes
result in confusion and people wasting time trying to
understand. (For example, folks tend to just look for big
files to delete instead of big directories or extensions, and
once they do so, then sometime later folks using the new
repository who are going through history will notice a build
artifact directory that has some files but not others, or a
cache of dependencies (node_modules or similar) which
couldn't have ever been functional since it's missing some
files.)
• If --prune-empty isn't specified, then the filtering process
can create hoards of confusing empty commits
• If --prune-empty is specified, then intentionally placed
empty commits from before the filtering operation are also
pruned instead of just pruning commits that became empty due
to filtering rules.
• If --prune-empty is specified, sometimes empty commits are
missed and left around anyway (a somewhat rare bug, but it
happens...)
• A minor issue, but users who have a goal to update all names
and emails in a repository may be led to --env-filter which
will only update authors and committers, missing taggers.
• If the user provides a --tag-name-filter that maps multiple
tags to the same name, no warning or error is provided;
git-filter-branch simply overwrites each tag in some
undocumented pre-defined order resulting in only one tag at
the end. (A git-filter-branch regression test requires this
surprising behavior.)
Also, the poor performance of git-filter-branch often leads to
safety issues:
• Coming up with the correct shell snippet to do the filtering
you want is sometimes difficult unless you're just doing a
trivial modification such as deleting a couple files.
Unfortunately, people often learn if the snippet is right or
wrong by trying it out, but the rightness or wrongness can
vary depending on special circumstances (spaces in filenames,
non-ascii filenames, funny author names or emails, invalid
timezones, presence of grafts or replace objects, etc.),
meaning they may have to wait a long time, hit an error, then
restart. The performance of git-filter-branch is so bad that
this cycle is painful, reducing the time available to
carefully re-check (to say nothing about what it does to the
patience of the person doing the rewrite even if they do
technically have more time available). This problem is extra
compounded because errors from broken filters may not be
shown for a long time and/or get lost in a sea of output.
Even worse, broken filters often just result in silent
incorrect rewrites.
• To top it all off, even when users finally find working
commands, they naturally want to share them. But they may be
unaware that their repo didn't have some special cases that
someone else's does. So, when someone else with a different
repository runs the same commands, they get hit by the
problems above. Or, the user just runs commands that really
were vetted for special cases, but they run it on a different
OS where it doesn't work, as noted above.