Bug#569505: git-core: 'git add' corrupts repository if the working directory is modified as it runs

Zygo Blaxell Wed, 03 Mar 2010 08:07:28 -0800

On Wed, Mar 03, 2010 at 09:12:04AM -0600, Jonathan Nieder wrote:
> Zygo Blaxell wrote:
> > If the index is then
> > committed, the corrupt blob cannot be checked out (or is checked out
> > with incorrect contents, depending on which tool you use to try to get
> > the file out of git) in the future.
> 
> Do you remember more details?  Some commands might be worth changing
> to be more permissive (to more easily recover from corruption) or
> strict (to more easily catch it).


I think whether you get no object vs. corrupt object depends on whether
the object is loose or packed.

Different git versions behave slightly differently, though I can't get
a corrupt data case at all at the moment.

git checkout gives you a working tree where corrupt files are missing,
and an index where corrupt files are marked deleted.

git filter-branch aborts when it sees the corrupt data if you have a
tree-filter, but if you only have an index-filter it will ignore
corrupt objects unless you do something to force it to examine their
contents.

filter-branch index-filter won't help you if other objects have been
deltaified based on corrupt objects--at that point, recovery is very hard.
I've only seen that occur on pack files that were corrupted outside of
git, though, so it's not a Git problem.

> > Surprisingly, it's possible to clone, fetch, push, pull, and sometimes
> > even gc the corrupted repo several times before anyone notices the
> > corruption.
> 
> For push, I do not think it is worth the slowdown; better to spend the
> time needed to be robust in hash-object.
> 
> On the other hand, it does not seem overly burdensome to me to make
> ???git gc??? imply a ???git fsck --no-full???.  That wouldn???t be needed to
> catch on-disk corruption (the CRC32 already does that), but it could
> be a good idea anyway.  Cloning the bug.

git gc will notice the corruption if it's packing corrupt loose objects.
It fails to notice if it's not packing loose objects, e.g. because
the loose objects are not old enough.

push and fetch cases are covered by the receive.fsckObjects config
variable.  It defaults to off, but if it's turned on then it will catch
the corruption at push/fetch time.  Obviously this doesn't help if you
have a corrupt repo and you never push/fetch it anywhere, and it also
doesn't help you if you don't change the setting from the default.

> > If the affected commit is included in a merge with history
> > from other git users, the only way to fix it is to rebase
> 
> I assume you used rebase -f?  Clever.

I reset to one commit before the corruption, then manually extract the
surviving changes between the commit after the corruption and the next
commit that modifies the corrupted file.  After that usually all future
commits can be cherry-picked.  After that gets pushed, everyone *else*
working on the project has to rebase their changes.

> If it does not fail,
> 
>  git filter-branch --tree-filter :
> 
> should accomplish the same thing without making the history linear,
> though it does not scale well to deep histories.  It should be
> possible to devise an appropriate index-filter for this, too, if that
> is too slow.

The problem isn't speed--the problem is tree-filter's requirement to check
out the data.  It can't, because the data is corrupt.  filter-branch does
check in that case, and it should (otherwise a filesystem on unreliable
media could spray undetected junk into your repo).

> > (or come up
> > with a blob whose contents match the affected SHA1 hash somehow).
> 
> Yes, depending on how the file was changing this could be easy or hard
> to do.  /usr/share/doc/git-doc/howto/recover-corrupted-blob-object.txt
> is about something like this.

It's usually hard when the file was in some transient state during the
SHA1 calculation.  ;)

> ???Something like git that can handle snapshots??? tends to be something
> people want every once in a while.  Of course, git has some problems
> for these uses:
> 
>  - racy add, as you noticed;

Only Git seems to have that.  SVN and CVS didn't.  Or maybe they did,
but they lacked the internal integrity-checking mechanisms to detect it.

>  - checkout is not atomic or close to atomic;

Not a problem in my use cases.  Checkouts are very rare, usually only
occurring after some disaster or other.

>  - large files are not supported well (but there is some work going on
>    to change this);

"Large" is relative to the size of the system doing the work.  15 years
ago, 1MB was a "large" file; today, 1MB is on the high end of "small."

>  - uncompressible files are not supported well;

Much better than CVS.

>  - file metadata is not tracked;

Usually not a problem.

>  - directories with many entries are not supported well;

ext3 doesn't support that use case well either.  Fortunately it rarely
comes up.

>  - rename detection works poorly with binary files;

Still better than CVS or SVN.

>  - no quick way to throw away old history.

I don't intend to throw away old history at all.  Some of my snapshot
repos go back more than 10 years (they have been translated from CVS to
Subversion before Git).

> If the tree to be tracked is small (like /etc), one can try to work
> around these things and get away with it for a while.
> 
> Still, when so many of git???s features are the opposite of useful for
> the goal, it is a wonder it is so tempting to try anyway.  My guess is
> it is the UI for browsing and manipulating history.

Compression, integrity checking, and replication are the big wins for me.
The compression advantage of Git vs. other tools is not trivial.  Git
outperforms Subversion by something like 200:1.

The UI is important too, after a fashion--I have a bunch of tools
that notify me when they detect changes in some arbitrary git repo,
and it's easy to add snapshot repos to that notification framework.
It's usually more convenient to add the repos to the framework than to
add the framework to the repos, since that usually requires adding a
pile of extra tools on small or isolated machines.  Also, doing it
that way means the notification framework holds a backup copy of
the snapshot repo.




-- 
To UNSUBSCRIBE, email to [email protected]
with a subject of "unsubscribe". Trouble? Contact [email protected]

Bug#569505: git-core: 'git add' corrupts repository if the working directory is modified as it runs

Reply via email to