On Sun, Jan  9 at 22:54, Peter Taps wrote:
Thank you all for your help. I am the OP.

I haven't looked at the link that talks about the probability of
collision. Intuitively, I still wonder how the chances of collision
can be so low. We are reducing a 4K block to just 256 bits. If the
chances of collision are so low, *theoretically* it is possible to
reconstruct the original block from the 256-bit signature by using a
simple lookup. Essentially, we would now have world's best
compression algorithm irrespective of whether the data is text or
binary. This is hard to digest.

Peter

"simple" lookup isn't so simple when there are 2^256 records to
search, however, fundamentally your understanding of hashes is
correct.

That being said, while at some point people might identify two
commonly-used blocks with the same hash (e.g. system library files or
other) the odds of it happening are extremely low.  Random
google-result website calculates you as needing ~45 exabytes in your
pool of 4KB chunk deduped data before you get to a ~10^-17 chance of a
hash collision:

http://www.backupcentral.com/mr-backup-blog-mainmenu-47/13-mr-backup-blog/145-de-dupe-hash-collisions.html

Now, obviously the above is in the context of having to restore from
backup, which is rare, however in live usage I don't think the math
changes a whole lot.



--
Eric D. Mudama
edmud...@mail.bounceswoosh.org

_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Reply via email to