Путеводитель по Руководству Linux

  User  |  Syst  |  Libr  |  Device  |  Files  |  Other  |  Admin  |  Head  |



   git-filter-branch    ( 1 )

переписать ветки (Rewrite branches)

SAFETY

git-filter-branch is riddled with gotchas resulting in various
       ways to easily corrupt repos or end up with a mess worse than
       what you started with:

• Someone can have a set of "working and tested filters" which they document or provide to a coworker, who then runs them on a different OS where the same commands are not working/tested (some examples in the git-filter-branch manpage are also affected by this). BSD vs. GNU userland differences can really bite. If lucky, error messages are spewed. But just as likely, the commands either don't do the filtering requested, or silently corrupt by making some unwanted change. The unwanted change may only affect a few commits, so it's not necessarily obvious either. (The fact that problems won't necessarily be obvious means they are likely to go unnoticed until the rewritten history is in use for quite a while, at which point it's really hard to justify another flag-day for another rewrite.)

• Filenames with spaces are often mishandled by shell snippets since they cause problems for shell pipelines. Not everyone is familiar with find -print0, xargs -0, git-ls-files -z, etc. Even people who are familiar with these may assume such flags are not relevant because someone else renamed any such files in their repo back before the person doing the filtering joined the project. And often, even those familiar with handling arguments with spaces may not do so just because they aren't in the mindset of thinking about everything that could possibly go wrong.

• Non-ascii filenames can be silently removed despite being in a desired directory. Keeping only wanted paths is often done using pipelines like git ls-files | grep -v ^WANTED_DIR/ | xargs git rm. ls-files will only quote filenames if needed, so folks may not notice that one of the files didn't match the regex (at least not until it's much too late). Yes, someone who knows about core.quotePath can avoid this (unless they have other special characters like \t, \n, or "), and people who use ls-files -z with something other than grep can avoid this, but that doesn't mean they will.

• Similarly, when moving files around, one can find that filenames with non-ascii or special characters end up in a different directory, one that includes a double quote character. (This is technically the same issue as above with quoting, but perhaps an interesting different way that it can and has manifested as a problem.)

• It's far too easy to accidentally mix up old and new history. It's still possible with any tool, but git-filter-branch almost invites it. If lucky, the only downside is users getting frustrated that they don't know how to shrink their repo and remove the old stuff. If unlucky, they merge old and new history and end up with multiple "copies" of each commit, some of which have unwanted or sensitive files and others which don't. This comes about in multiple different ways:

• the default to only doing a partial history rewrite (--all is not the default and few examples show it)

• the fact that there's no automatic post-run cleanup

• the fact that --tag-name-filter (when used to rename tags) doesn't remove the old tags but just adds new ones with the new name

• the fact that little educational information is provided to inform users of the ramifications of a rewrite and how to avoid mixing old and new history. For example, this man page discusses how users need to understand that they need to rebase their changes for all their branches on top of new history (or delete and reclone), but that's only one of multiple concerns to consider. See the "DISCUSSION" section of the git filter-repo manual page for more details.

• Annotated tags can be accidentally converted to lightweight tags, due to either of two issues:

• Someone can do a history rewrite, realize they messed up, restore from the backups in refs/original/, and then redo their git-filter-branch command. (The backup in refs/original/ is not a real backup; it dereferences tags first.)

• Running git-filter-branch with either --tags or --all in your <rev-list options>. In order to retain annotated tags as annotated, you must use --tag-name-filter (and must not have restored from refs/original/ in a previously botched rewrite).

• Any commit messages that specify an encoding will become corrupted by the rewrite; git-filter-branch ignores the encoding, takes the original bytes, and feeds it to commit-tree without telling it the proper encoding. (This happens whether or not --msg-filter is used.)

• Commit messages (even if they are all UTF-8) by default become corrupted due to not being updated — any references to other commit hashes in commit messages will now refer to no-longer-extant commits.

• There are no facilities for helping users find what unwanted crud they should delete, which means they are much more likely to have incomplete or partial cleanups that sometimes result in confusion and people wasting time trying to understand. (For example, folks tend to just look for big files to delete instead of big directories or extensions, and once they do so, then sometime later folks using the new repository who are going through history will notice a build artifact directory that has some files but not others, or a cache of dependencies (node_modules or similar) which couldn't have ever been functional since it's missing some files.)

• If --prune-empty isn't specified, then the filtering process can create hoards of confusing empty commits

• If --prune-empty is specified, then intentionally placed empty commits from before the filtering operation are also pruned instead of just pruning commits that became empty due to filtering rules.

• If --prune-empty is specified, sometimes empty commits are missed and left around anyway (a somewhat rare bug, but it happens...)

• A minor issue, but users who have a goal to update all names and emails in a repository may be led to --env-filter which will only update authors and committers, missing taggers.

• If the user provides a --tag-name-filter that maps multiple tags to the same name, no warning or error is provided; git-filter-branch simply overwrites each tag in some undocumented pre-defined order resulting in only one tag at the end. (A git-filter-branch regression test requires this surprising behavior.)

Also, the poor performance of git-filter-branch often leads to safety issues:

• Coming up with the correct shell snippet to do the filtering you want is sometimes difficult unless you're just doing a trivial modification such as deleting a couple files. Unfortunately, people often learn if the snippet is right or wrong by trying it out, but the rightness or wrongness can vary depending on special circumstances (spaces in filenames, non-ascii filenames, funny author names or emails, invalid timezones, presence of grafts or replace objects, etc.), meaning they may have to wait a long time, hit an error, then restart. The performance of git-filter-branch is so bad that this cycle is painful, reducing the time available to carefully re-check (to say nothing about what it does to the patience of the person doing the rewrite even if they do technically have more time available). This problem is extra compounded because errors from broken filters may not be shown for a long time and/or get lost in a sea of output. Even worse, broken filters often just result in silent incorrect rewrites.

• To top it all off, even when users finally find working commands, they naturally want to share them. But they may be unaware that their repo didn't have some special cases that someone else's does. So, when someone else with a different repository runs the same commands, they get hit by the problems above. Or, the user just runs commands that really were vetted for special cases, but they run it on a different OS where it doesn't work, as noted above.