On Thu, January 6, 2011 14:44, Peter Taps wrote:
> I have been told that the checksum value returned by Sha256 is almost
> guaranteed to be unique. In fact, if Sha256 fails in some case, we have a
> bigger problem such as memory corruption, etc. Essentially, adding
> verification to sha256 is an overkill.
>
> Perhaps (Sha256+NoVerification) would work 99.999999% of the time. But
> (Fletcher+Verification) would work 100% of the time.
>
> Which one of the two is a better deduplication strategy?

The ZFS default is what you should be using unless you can explain
(technically, and preferably mathematically) why you should use something
else.

I'm guessing you're using "99.999999%" as a 'literary gesture', and
haven't done the math. The above means that you have a 0.0000001% or 10^-7
chance of having a collision.

The reality is that the odds are actually 10^-77 (~ 2^-256; see [1] though):

    http://blogs.sun.com/bonwick/entry/zfs_dedup

As a form of comparison, the odds of having an non-recoverable bit error
from a hard disk is about 10^15 for SAS disks and 10^-14 for SATA disks.
So you're about sixty times more likely to have a disk read error than get
a collision from SHA-256.

If you're not worried about disk read errors (and/or are not experiencing
them), then you shouldn't be worried about has collisions.

TL;DR: do a "dedupe=on" and forget about it.

Some more discussion as it relates to some backup dedupe appliances (the
principles are the same):

http://tinyurl.com/36369pb
http://www.backupcentral.com/mr-backup-blog-mainmenu-47/13-mr-backup-blog/145-de-dupe-hash-collisions.html


[1] It may actually be 10^-38 (2^-128) or so because of the birthday
paradox, but we're still talking unlikely. You have a better chance of
dying from lightning or being attacked by a mountain lion:

http://www.blog.joelx.com/odds-chances-of-dying/877/

_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Reply via email to