Patrick Useldinger wrote:
Just to explain why I appear to be a lawer: everybody I spoke to about
this program told me to use hashes, but nobody has been able to explain
why. I found myself 2 possible reasons:
1) it's easier to program: you don't compare several files in parallel,
but process one by one. But it's not perfect and you still need to
compare afterwards. In the worst case, you end up with 3 files with
identical hashes, of which 2 are identical and 1 is not. In order to
find this, you'd still have to program the algorithm I use, unless you
say "oh well, there's a problem with the hash, go and look yourself."
2) it's probably useful if you compare files over a network and you want
to reduce bandwidth. A hash lets you do that at the cost of local CPU
and disk usage, which may be OK. That was not my case.
3) Probability is on your side. If two files match in length, but differ
in contents, a good hash will have probability (1 / max_hash) of
having a matching hash. For exactly two files of the same length,
calculating the hash is clearly a waste. For even three files of
the same length, you are most likely to find them distinct in three
comparisons. Using hashes, three file reads and three comparisons
of hash values. Without hashes, six file reads; you must read both
files to do a file comparison, so three comparisons is six files.
Naturally, as it grows beyond three, the deference grows drastically.
-Scott David Daniels
[EMAIL PROTECTED]
--
http://mail.python.org/mailman/listinfo/python-list