On Thu, January 6, 2011 14:44, Peter Taps wrote: > I have been told that the checksum value returned by Sha256 is almost > guaranteed to be unique. In fact, if Sha256 fails in some case, we have a > bigger problem such as memory corruption, etc. Essentially, adding > verification to sha256 is an overkill. > > Perhaps (Sha256+NoVerification) would work 99.999999% of the time. But > (Fletcher+Verification) would work 100% of the time. > > Which one of the two is a better deduplication strategy?
The ZFS default is what you should be using unless you can explain (technically, and preferably mathematically) why you should use something else. I'm guessing you're using "99.999999%" as a 'literary gesture', and haven't done the math. The above means that you have a 0.0000001% or 10^-7 chance of having a collision. The reality is that the odds are actually 10^-77 (~ 2^-256; see [1] though): http://blogs.sun.com/bonwick/entry/zfs_dedup As a form of comparison, the odds of having an non-recoverable bit error from a hard disk is about 10^15 for SAS disks and 10^-14 for SATA disks. So you're about sixty times more likely to have a disk read error than get a collision from SHA-256. If you're not worried about disk read errors (and/or are not experiencing them), then you shouldn't be worried about has collisions. TL;DR: do a "dedupe=on" and forget about it. Some more discussion as it relates to some backup dedupe appliances (the principles are the same): http://tinyurl.com/36369pb http://www.backupcentral.com/mr-backup-blog-mainmenu-47/13-mr-backup-blog/145-de-dupe-hash-collisions.html [1] It may actually be 10^-38 (2^-128) or so because of the birthday paradox, but we're still talking unlikely. You have a better chance of dying from lightning or being attacked by a mountain lion: http://www.blog.joelx.com/odds-chances-of-dying/877/ _______________________________________________ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss