On Thu, May 09, 2013 at 09:27:28AM +0300, Elazar Leibovich wrote:
> On Wed, May 8, 2013 at 11:11 PM, Tzafrir Cohen <tzaf...@cohens.org.il>wrote:
> 
> >
> > Git stores files. It should do handle such deduping by design. But this
> > is in Git's storage, and not in the actual filesystem:
> >
> 
> git packs them in a pack file.
> 
> Use git gc to make it aware of changes, or just look at my reply to Oleg.

(It's really a side-issue, as it won't help you, but, xkcd.com/385.
Warning: another long post)

No. That's pure file-level de-duplication.

Following my previous run:

tzafrir@pungenday:/tmp/git-test(master)$ du -s .git .
1188    .git
2052    .
tzafrir@pungenday:/tmp/git-test(master)$ git ls-tree HEAD
100644 blob e5ad63a5eb4a806ae572e977742cdce9e9f74cfa    rand
100644 blob e5ad63a5eb4a806ae572e977742cdce9e9f74cfa    rand1

now let's see how it handles gzipped files. Note that you need to add
~2M to the size of '.' in the following, as 'rand' and 'rand1' are
missing from it.

When we use gzip with -n: we get exactly the same file:

tzafrir@pungenday:/tmp/git-test(master)$ gzip -n rand
tzafrir@pungenday:/tmp/git-test(master)$ gzip -n rand1
tzafrir@pungenday:/tmp/git-test(master)$ git add rand*.gz
tzafrir@pungenday:/tmp/git-test(master)$ git commit -m gzipped
[master 443ded4] gzipped
 2 files changed, 0 insertions(+), 0 deletions(-)
 create mode 100644 rand.gz
 create mode 100644 rand1.gz
tzafrir@pungenday:/tmp/git-test(master)$ du -s .git .
2236    .git
2060    .
tzafrir@pungenday:/tmp/git-test(master)$ git ls-tree HEAD
100644 blob e5ad63a5eb4a806ae572e977742cdce9e9f74cfa    rand
100644 blob a25f7e04d2c142d0f25d88850d75999c7cfa8391    rand.gz
100644 blob e5ad63a5eb4a806ae572e977742cdce9e9f74cfa    rand1
100644 blob a25f7e04d2c142d0f25d88850d75999c7cfa8391    rand1.gz

If not, we get different files: 

tzafrir@pungenday:/tmp/git-test(master)$ git show HEAD:rand > rand2
tzafrir@pungenday:/tmp/git-test(master)$ git show HEAD:rand > rand3
tzafrir@pungenday:/tmp/git-test(master)$ gzip rand2
tzafrir@pungenday:/tmp/git-test(master)$ gzip rand3
tzafrir@pungenday:/tmp/git-test(master)$ md5sum rand2.gz rand3.gz 
603d95587520d3ca203329eaeea8ac6c  rand2.gz
572f7178846083f82bb56da1e996d9a1  rand3.gz
tzafrir@pungenday:/tmp/git-test(master)$ git add rand[23].gz
tzafrir@pungenday:/tmp/git-test(master)$ git commit -m "gzipped without
-n"
tzafrir@pungenday:/tmp/git-test(master)$ du -s .git .
4316    .git
4116    .
tzafrir@pungenday:/tmp/git-test(master)$ git ls-tree HEAD
100644 blob e5ad63a5eb4a806ae572e977742cdce9e9f74cfa    rand
100644 blob a25f7e04d2c142d0f25d88850d75999c7cfa8391    rand.gz
100644 blob e5ad63a5eb4a806ae572e977742cdce9e9f74cfa    rand1
100644 blob a25f7e04d2c142d0f25d88850d75999c7cfa8391    rand1.gz
100644 blob 4e2e7106fad7c2d38dfed8b0221686ac561709ac    rand2.gz
100644 blob f2f9615fb255d7e75d74e65a675ca850a3c67a34    rand3.gz

rand.gz and rand1.gz have the same content (or rather: sha1 checksum of
the content) and thus are considered to be the same. When you don't use
'-n', you can get a slightly different result of the compression
(rather: a header. The rest of the file is the same).

That's before any packing. When git packs it, it can compress the three
gzip-compressions of rand to almost the size of a single one:

tzafrir@pungenday:/tmp/git-test(master)$ git gc
Counting objects: 12, done.
Delta compression using up to 2 threads.
Compressing objects: 100% (11/11), done.
Writing objects: 100% (12/12), done.
Total 12 (delta 4), reused 0 (delta 0)
tzafrir@pungenday:/tmp/git-test(master)$ du -s .git .
1164    .git
4116    .
tzafrir@pungenday:/tmp/git-test(master)$ git ls-tree HEAD
100644 blob e5ad63a5eb4a806ae572e977742cdce9e9f74cfa    rand
100644 blob a25f7e04d2c142d0f25d88850d75999c7cfa8391    rand.gz
100644 blob e5ad63a5eb4a806ae572e977742cdce9e9f74cfa    rand1
100644 blob a25f7e04d2c142d0f25d88850d75999c7cfa8391    rand1.gz
100644 blob 4e2e7106fad7c2d38dfed8b0221686ac561709ac    rand2.gz
100644 blob f2f9615fb255d7e75d74e65a675ca850a3c67a34    rand3.gz

And indeed:

tzafrir@pungenday:/tmp/git-test(master)$ cat rand.gz rand2.gz rand3.gz |
wc -c
3146274
tzafrir@pungenday:/tmp/git-test(master)$ cat rand.gz rand2.gz rand3.gz |
gzip | wc -c
3146772

Hmm... not quite as expected. We need a larger dictionary:

tzafrir@pungenday:/tmp/git-test(master)$ cat rand.gz rand2.gz rand3.gz |
gzip -9 | wc -c
3146772
tzafrir@pungenday:/tmp/git-test(master)$ cat rand.gz rand2.gz rand3.gz |
bzip2 | wc -c
3160302
tzafrir@pungenday:/tmp/git-test(master)$ cat rand.gz rand2.gz rand3.gz |
bzip2 -9 | wc -c
3160302
tzafrir@pungenday:/tmp/git-test(master)$ cat rand.gz rand2.gz rand3.gz |
xz | wc -c
1049516

Finally. So as I was saying, those three compress very well together and
thus git can efifciently pack them. The size also suggests that the
compression did not change the content of the file very much in this
pathological case of a file with close-to-random content:

zafrir@pungenday:/tmp/git-test(master)$ cat rand rand1 rand.gz rand1.gz
rand2.gz rand3.gz | xz | wc -c
1050152

-- 
Tzafrir Cohen         | tzaf...@jabber.org | VIM is
http://tzafrir.org.il |                    | a Mutt's
tzaf...@cohens.org.il |                    |  best
tzaf...@debian.org    |                    | friend

_______________________________________________
Linux-il mailing list
Linux-il@cs.huji.ac.il
http://mailman.cs.huji.ac.il/mailman/listinfo/linux-il

Reply via email to