Re: a program to delete duplicate files

Scott David Daniels Sat, 12 Mar 2005 06:55:05 -0800

Patrick Useldinger wrote:

Just to explain why I appear to be a lawer: everybody I spoke to about this program told me to use hashes, but nobody has been able to explain why. I found myself 2 possible reasons:

1) it's easier to program: you don't compare several files in parallel, but process one by one. But it's not perfect and you still need to compare afterwards. In the worst case, you end up with 3 files with identical hashes, of which 2 are identical and 1 is not. In order to find this, you'd still have to program the algorithm I use, unless you say "oh well, there's a problem with the hash, go and look yourself."

2) it's probably useful if you compare files over a network and you want to reduce bandwidth. A hash lets you do that at the cost of local CPU and disk usage, which may be OK. That was not my case.

3) Probability is on your side.  If two files match in length, but differ
   in contents, a good hash will have probability (1 / max_hash) of
   having a matching hash.  For exactly two files of the same length,
   calculating the hash is clearly a waste.  For even three files of
   the same length, you are most likely to find them distinct in three
   comparisons.  Using hashes, three file reads and three comparisons
   of hash values.  Without hashes, six file reads; you must read both
   files to do a file comparison, so three comparisons is six files.
   Naturally, as it grows beyond three, the deference grows drastically.

-Scott David Daniels
[EMAIL PROTECTED]
--
http://mail.python.org/mailman/listinfo/python-list

Re: a program to delete duplicate files

Reply via email to