On Thu, 06 Jan 2011 22:42:15 PST Michael DeMan <sola...@deman.com> wrote: > To be quite honest, I too am skeptical about about using de-dupe just based o > n SHA256. In prior posts it was asked that the potential adopter of the tech > nology provide the mathematical reason to NOT use SHA-256 only. However, if > Oracle believes that it is adequate to do that, would it be possible for some > body to provide: > > (A) The theoretical documents and associated mathematics specific to say one > simple use case?
See http://en.wikipedia.org/wiki/Birthday_problem -- in particular see section 5.1 and the probability table of section 3.4. > On Jan 6, 2011, at 10:05 PM, Edward Ned Harvey wrote: > > >> I have been told that the checksum value returned by Sha256 is almost > >> guaranteed to be unique. In fact, if Sha256 fails in some case, we have a > >> bigger problem such as memory corruption, etc. Essentially, adding > >> verification to sha256 is an overkill. Agreed. > > Someone please correct me if I'm wrong. OK :-) > > Suppose you have 128TB of data. That is ... you have 2^35 unique 4k block > s > > of uniformly sized data. Then the probability you have any collision in > > your whole dataset is (sum(1 thru 2^35))*2^-256 > > Note: sum of integers from 1 to N is (N*(N+1))/2 > > Note: 2^35 * (2^35+1) = 2^35 * 2^35 + 2^35 = 2^70 + 2^35 > > Note: (N*(N+1))/2 in this case = 2^69 + 2^34 > > So the probability of data corruption in this case, is 2^-187 + 2^-222 ~= > > 5.1E-57 + 1.5E-67 > > > > ~= 5.1E-57 I believe this is wrong. See the wikipedia article referenced above. p(n,d) = 1 - d!/(d^n*(d-n)!) In your example n = 2^35, d = 2^256. If you extrapolate the 256 bits row of the probability table of section 3.1, it is somewhere between 10^-48 and 10^-51. This may be easier to grasp: to get a 50% probability of a collision with sha256, you need 4*10^38 blocks. For a probability similar to disk error rates (10^-15), you need 1.5*10^31 blocks. _______________________________________________ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss