Thanks for the advice. There are definitely some performance issues I hadn't thought of before. I guess it's time to go lengthen, not shorten, the script.

Lowell

John Machin wrote:
Lowell Kirsh wrote:

I have a script which I use to find all duplicates of files within a
given directory and all its subdirectories. It seems like it's longer


than it needs to be but I can't figure out how to shorten it. Perhaps


there are some python features or libraries I'm not taking advantage

of.

The way it works is that it puts references to all the files in a
dictionary with file size being the key. The dictionary can hold
multiple values per key. Then it looks at each key and all the
associated files (which are the same size). Then it uses filecmp to

see

if they are actually byte-for-byte copies.

It's not 100% complete but it's pretty close.

Lowell


To answer the question in the message subject: 1,$d

And that's not just the completely po-faced literal answer that the
question was begging for: why write something when it's already been
done? Try searching this newsgroup; there was a discussion on this very
topic only a week ago, during which the effbot provided the URL of an
existing python file duplicate detector. There seems to be a discussion
every so often ...

However if you persist in DIY, read the discussions in this newsgroup,
search the net (people have implemented this functionality in other
languages); think about some general principles -- like should you use
a hash (e.g. SHA-n where n is a suitably large number). If there are N
files all of the same size, you have two options (a) do O(N**2) file
comparisons or (b) do N hash calcs followed by O(N**2) hash
comparisons; then deciding on your
need/whim/costs-of-false-negatives/positives you can stop there or you
can do the file comparisons on the ones which match on hashes. You do
however need to consider that calculating the hash involves reading the
whole file, whereas comparing two files can stop when a difference is
detected. Also, do you understand and are you happy with using the
(default) "shallow" option of filecmp.cmp()?

--
http://mail.python.org/mailman/listinfo/python-list

Reply via email to