On Mon, Feb 08, 2016 at 11:55:37PM +0000, Anatoly Borodin wrote:

> unfortunately, `--tree-filter true` doesn't solve the problem with the
> repo at my work: not all old blobs are replaced with the new ones. I've
> made a test repository to demonstrate it; it's a huge one (115M), but I
> couldn't make it much smaller, because the bug fails to reproduce if the
> repo is not big enough:
> 
> https://github.com/anatolyborodin/git-filter-branch-bug
> 
> There are some description and instructions in `README.md`. I hope you
> will be able to reproduce it on your machine, if not - just add more
> files :)
> 
> I've debugged the test case and found the place where `git diff-index`
> behaves differently regarding `aa/bb.dat`:
> 
> read-cache.c +351     ie_match_stat():
> ...
>       if (!changed && is_racy_timestamp(istate, ce)) {
>               if (assume_racy_is_modified)
>                       changed |= DATA_CHANGED;
>               else
>                       changed |= ce_modified_check_fs(ce, st);
>       }
> ...
> 
> After `git-checkout-index` the blob hash for `aa/bb.dat` in the index is
> 88a0f09b9b2e4ccf2faec89ab37d416fba4ee79d (the huge binary), but the file
> on disk is a text file "This file was to big, and it has been removed."
> with the hash 16e0939430610600301680d5bf8a24a22ff8b6c4.

Right, that makes sense. The index doesn't know anything about the
replace mechanism. So it assumes that what it wrote matches what is in
the stat cache (i.e., some sha1 and the matching stat info). Later, when
git wants to know "should I bother reading this file back in and
computing its sha1", the stat cache says no.

And then as you noticed, it sometimes "works" because if the file and
index timestamps are the same, we err on the side of re-reading. So more
files means more likelihood of crossing the one-second boundary.

> I don't know if it should be considered to be a bug in in the logic of
> `git checkout-index`, or `git diff-index` / `git update-index`.

I'd say if any, it is the fault of checkout-index for writing out
content that does not match the sha1 in the index, but writing out the
stat information as if it is clean. For your case, you'd obviously
prefer that the file be marked dirty and re-read later.

But I'm not sure whether other users of replace refs would want the same
behavior. They may want to silently pretend as if the data is
unmodified. I'm not sure if anyone is doing that in practice, though; as
you noted, the results are not deterministic, so it probably ends up
just being a huge pain. So perhaps it would make sense to make the
checkout operation aware of replaced blobs.

As a workaround for your filter-branch, I suspect you could do
`--tree-filter 'git ls-files | xargs touch'` or something, but that is
going to be rather inefficient.

By now you've probably realized that replaced blobs are not widely used,
and there are a lot of corner cases around checking them out. So let's
take a step back for a moment. What are you trying to accomplish with
your filter-branch? If you just want to replace blob A with blob B, I
think it might be easier to skip refs/replace entirely and just do so
explicitly in an index-filter, like:

  --index-filter '
    git ls-files -s |
    sed "s/$old_sha1/$new_sha1/" |
    git update-index --index-info
  '

-Peff
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Reply via email to