Lowell
John Machin wrote:
Lowell Kirsh wrote:
I have a script which I use to find all duplicates of files within a given directory and all its subdirectories. It seems like it's longer
than it needs to be but I can't figure out how to shorten it. Perhaps
there are some python features or libraries I'm not taking advantage
of.
The way it works is that it puts references to all the files in a dictionary with file size being the key. The dictionary can hold multiple values per key. Then it looks at each key and all the associated files (which are the same size). Then it uses filecmp to
see
if they are actually byte-for-byte copies.
It's not 100% complete but it's pretty close.
Lowell
To answer the question in the message subject: 1,$d
And that's not just the completely po-faced literal answer that the question was begging for: why write something when it's already been done? Try searching this newsgroup; there was a discussion on this very topic only a week ago, during which the effbot provided the URL of an existing python file duplicate detector. There seems to be a discussion every so often ...
However if you persist in DIY, read the discussions in this newsgroup, search the net (people have implemented this functionality in other languages); think about some general principles -- like should you use a hash (e.g. SHA-n where n is a suitably large number). If there are N files all of the same size, you have two options (a) do O(N**2) file comparisons or (b) do N hash calcs followed by O(N**2) hash comparisons; then deciding on your need/whim/costs-of-false-negatives/positives you can stop there or you can do the file comparisons on the ones which match on hashes. You do however need to consider that calculating the hash involves reading the whole file, whereas comparing two files can stop when a difference is detected. Also, do you understand and are you happy with using the (default) "shallow" option of filecmp.cmp()?
-- http://mail.python.org/mailman/listinfo/python-list