On Dec 1, 2013, at 1:31 PM, Graham Cox <graham....@bigpond.com> wrote:

> That doesn’t mean it’s impossible, but it surely indicates that it’s 
> vanishingly improbable?

Actually, no. The next file added could collide, and your test provides 0 
information as to how likely that is, or even if the fact that you got no 
collisions was an outcome that had a vanishingly small probability. Now, the 
probability of collisions is in fact small, but your test doesn't prove it--it 
just provides evidence that the probability analysis is not fantastically wrong.

What would provide stronger evidence would be looking at both sets of 63-bit 
hashes, all 3 sets of 62-bit hashes, and so on until you find the size at which 
you get collisions. But this would be an exercise purely for curiosity.

Now, sometime ago in this discussion, Kyle mentioned the "birthday problem", 
and it is a valid point, in that the probability of a collision is proportional 
to an exponential of the square of the number of documents. So, while for 1,000 
documents the chance of a collision is down around 1 in 10^14, for 1,000,000 
documents it's up to 1 in 10^7. But there's an easy solution, use a 128-bit 
hash. (The probability of collision is inversely proportional to an exponential 
of the number of hashes, in other words to an exponential of 2^ number of 
bits.) So with 128 bits you're at 1 in 10^33 for 1,000 documents, and 1 in 
10^27 for 1,000,000 documents. On the other hand, with a 32-bit hash, at 1,000 
documents your odds are not so comforting (1 in 10,000 chance of a collision) 
and for 1,000,000 documents, ugh, the odds of *not* having a collision are only 
about 1 in 10^51. All of that is assuming a hash with perfectly random 
distribution, and rounded to 1 significant digit as is appropriate for such 
rough approximations.

Of course, you can always treat the hash as a quick index to find possible 
matches, then compare contents. Even with 32-bit hashes, you would not be 
comparing files very often, with 64-bit probably never, with 128-bit never, to 
a degree of certainty greater than anything else in your life or your 
customers'...

-- 
Scott Ribe
scott_r...@elevated-dev.com
http://www.elevated-dev.com/
(303) 722-0567 voice





_______________________________________________

Cocoa-dev mailing list (Cocoa-dev@lists.apple.com)

Please do not post admin requests or moderator comments to the list.
Contact the moderators at cocoa-dev-admins(at)lists.apple.com

Help/Unsubscribe/Update your Subscription:
https://lists.apple.com/mailman/options/cocoa-dev/archive%40mail-archive.com

This email sent to arch...@mail-archive.com

Reply via email to