On Tue, Feb 9, 2016 at 3:13 PM, Steven D'Aprano <steve+comp.lang.pyt...@pearwood.info> wrote: > On Tuesday 09 February 2016 02:11, Chris Angelico wrote: > >> That's fine for comparing one file against one other. He started out >> by saying he already had a way to compare files for equality. What he >> wants is a way to capitalize on that to find all the identical files >> in a group. A naive approach would simply compare every file against >> every other, for O(N*N) comparisons - but a hash lookup can make that >> O(N) on the files themselves, plus (I think) an O(N log N) hash >> comparison job, which has much lower constant factors. > > You're mixing up N's there :-) In the first two (compare every file against > every other file, and hash lookup), N stands for the number of files. But in > the hash comparison, well, I'm not sure what you mean by that, unless you > mean the calculation of the hash itself, which will be O(M) where M is the > size of the file. > > Unfortunately, even using a hash, you still have to do O(N**2) work, or > rather, O(N*M) where N is the number of files and M is their size. >
My intention was that every N was "number of files being compared", but it's possible I made a mistake elsewhere. Assuming the hashes become integers too large for a bucket sort (which they certainly will, unless you want inordinate numbers of hash collisions), the code would look something like this: hash_to_filename = defaultdict(list) for fn in files: # Step 1: Hash every file. hash = calculate_hash(fn) # Step 2: Locate all pairs of files with identical hashes hash_to_filename[hash].append(fn) This loop executes once for each file, so calculate_hash() is called once for each file. The cost of that is O(N), or possibly O(N*M). If you're fairly sure the files are going to differ in size, you could use the file size *as* the hash - it's cheap to calculate (no M component whatsoever - just a stat call, and even that could be optimized away in some cases, eg if you're scanning a whole directory), but isn't cryptographically secure and ignores file content altogether. The second step involves one dictionary lookup or insertion for each file. That's an O(log N) operation, where N is the number of elements in the dictionary. So yes, it's not quite the same as the number of files (if every file exists exactly twice, this N will be half the other N), but it's broadly going to correspond. And an O(log N) operation performed N times has an overall cost of O(N log N). I might have something wrong here, but definitely I meant to have the Ns all mean the same thing, like X on a Magic: The Gathering card. :) ChrisA -- https://mail.python.org/mailman/listinfo/python-list