At the end of the day this issue essentially is about mathematical 
improbability versus certainty?

To be quite honest, I too am skeptical about about using de-dupe just based on 
SHA256.  In prior posts it was asked that the potential adopter of the 
technology provide the mathematical reason to NOT use SHA-256 only.  However, 
if Oracle believes that it is adequate to do that, would it be possible for 
somebody to provide:

(A) The theoretical documents and associated mathematics specific to say one 
simple use case?
(A1) Total data size is 1PB (lets say the zpool is 2PB to not worry about that 
part of it).
(A2) Daily, 10TB of data is updated, 1TB of data is deleted, and 1TB of data is 
'new'.
(A3) Out of the dataset, 25% of the data is capable of being de-duplicated
(A4) Between A2 and A3 above, the 25% rule from A3 also applies to everything 
in A2.


I think the above would be a pretty 'soft' case for justifying the case that 
SHA-256 works?  I would presume some kind of simple kind of scenario 
mathematically has been run already by somebody inside Oracle/Sun long ago when 
first proposing that ZFS be funded internally at all?


Then - there is the other side of things.  The 'black swan' event.  At some 
point, given percentages on a scenario like the example case above, one simply 
has to make the business justification case internally at their own company 
about whether to go SHA-256 only or Fletcher+Verification?  Add Murphy's Law to 
the 'black swan event' and of course the only data that is lost is that .01% of 
your data that is the most critical?



Not trying to be aggressive or combative here at all against peoples opinions 
and understandings of it all - I would just like to see some hard information 
about it all - it must exist somewhere already?

Thanks,
 
- Mike




On Jan 6, 2011, at 10:05 PM, Edward Ned Harvey wrote:

>> From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
>> boun...@opensolaris.org] On Behalf Of Peter Taps
>> 
>> Perhaps (Sha256+NoVerification) would work 99.999999% of the time. But
> 
> Append 50 more 9's on there. 
> 99.99999999999999999999999999999999999999999999999999999999%
> 
> See below.
> 
> 
>> I have been told that the checksum value returned by Sha256 is almost
>> guaranteed to be unique. In fact, if Sha256 fails in some case, we have a
>> bigger problem such as memory corruption, etc. Essentially, adding
>> verification to sha256 is an overkill.
> 
> Someone please correct me if I'm wrong.  I assume ZFS dedup matches both the
> blocksize and the checksum right?  A simple checksum collision (which is
> astronomically unlikely) is still not sufficient to produce corrupted data.
> It's even more unlikely than that.
> 
> Using the above assumption, here's how you calculate the probability of
> corruption if you're not using verification:
> 
> Suppose every single block in your whole pool is precisely the same size
> (which is unrealistic in the real world, but I'm trying to calculate worst
> case.)  Suppose the block is 4K, which is again, unrealistically worst case.
> Suppose your dataset is purely random or sequential ... with no duplicated
> data ... which is unrealisic because if your data is like that, then why in
> the world are you enabling dedupe?  But again, assuming worst case
> scenario...  At this point we'll throw in some evil clowns, spit on a voodoo
> priestess, and curse the heavens for some extra bad luck.
> 
> If you have astronomically infinite quantities of data, then your
> probability of corruption approaches 100%.  With infinite data, eventually
> you're guaranteed to have a collision.  So the probability of corruption is
> directly related to the total amount of data you have, and the new question
> is:  For anything Earthly, how near are you to 0% probability of collision
> in reality?
> 
> Suppose you have 128TB of data.  That is ...  you have 2^35 unique 4k blocks
> of uniformly sized data.  Then the probability you have any collision in
> your whole dataset is (sum(1 thru 2^35))*2^-256 
> Note: sum of integers from 1 to N is  (N*(N+1))/2
> Note: 2^35 * (2^35+1) = 2^35 * 2^35 + 2^35 = 2^70 + 2^35
> Note: (N*(N+1))/2 in this case = 2^69 + 2^34
> So the probability of data corruption in this case, is 2^-187 + 2^-222 ~=
> 5.1E-57 + 1.5E-67
> 
> ~= 5.1E-57
> 
> In other words, even in the absolute worst case, cursing the gods, running
> without verification, using data that's specifically formulated to try and
> cause errors, on a dataset that I bet is larger than what you're doing, ...
> 
> Before we go any further ... The total number of bits stored on all the
> storage in the whole planet is a lot smaller than the total number of
> molecules in the planet.
> 
> There are estimated 8.87 * 10^49 molecules in planet Earth.
> 
> The probability of a collision in your worst-case unrealistic dataset as
> described, is even 100 million times less likely than randomly finding a
> single specific molecule in the whole planet Earth by pure luck.
> 
> 
> 
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss@opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Reply via email to