On Apr 17, 5:30 am, Tim Wintle <tim.win...@teamrubber.com> wrote: > On Thu, 2009-04-16 at 21:44 -0700, Adam Olsen wrote: > > The Wayback Machine has 150 billion pages, so 2**37. Google's index > > is a bit larger at over a trillion pages, so 2**40. A little closer > > than I'd like, but that's still 562949950000000 to 1 odds of having > > *any* collisions between *any* of the files. Step up to SHA-256 and > > it becomes 191561940000000000000000000000000000000000000000000000 to > > 1. Sadly, I can't even give you the odds for SHA-512, Qalculate > > considers that too close to infinite to display. :) > > That might be true as long as your data is completely uniformly > distributed. For the example you give there's: > > a) a high chance that there's "<html>" near the top > > b) a non-uniform distribution of individual words within the text. > > c) a non-unifom distribution of all n-grams within the text (as there is > in natural language) > > So it's very far from uniformly distributed. Just about the only > situation where I could imagine that holding would be where you are > hashing uniformly random data for the sake of testing the hash. > > I believe the point being made is that comparing hash values is a > probabilistic algorithm anyway, which is fine if you're ok with that, > but for mission critical software it's crazy.
Actually, *cryptographic* hashes handle that just fine. Even for files with just a 1 bit change the output is totally different. This is known as the Avalanche Effect. Otherwise they'd be vulnerable to attacks. Which isn't to say you couldn't *construct* a pattern that it would be vulnerable to. Figuring that out is pretty much the whole point of attacking a cryptographic hash. MD5 has significant vulnerabilities by now, and other will in the future. That's just a risk you need to manage. -- http://mail.python.org/mailman/listinfo/python-list