Re: A sets algorithm

Cem Karan Sun, 07 Feb 2016 17:10:18 -0800

On Feb 7, 2016, at 4:46 PM, Paulo da Silva <p_s_d_a_s_i_l_v_a...@netcabo.pt> 
wrote:


> Hello!
> 
> This may not be a strict python question, but ...
> 
> Suppose I have already a class MyFile that has an efficient method (or
> operator) to compare two MyFile s for equality.
> 
> What is the most efficient way to obtain all sets of equal files (of
> course each set must have more than one file - all single files are
> discarded)?
> 
> Thanks for any suggestions.

If you're after strict equality (every byte in a pair of files is identical), 
then here are a few heuristics that may help you:

1) Test for file length, without reading in the whole file.  You can use 
os.path.getsize() to do this (I hope that this is a constant-time operation, 
but I haven't tested it).  As Oscar Benjamin suggested, you can create a 
defaultdict(list) which will make it possible to gather lists of files of equal 
size.  This should help you gather your potentially identical files quickly.

2) Once you have your dictionary from above, you can iterate its values, each 
of which will be a list.  If a list has only one file in it, you know its 
unique, and you don't have to do any more work on it.  If there are two files 
in the list, then you have several different options:
        a) Use Chris Angelico's suggestion and hash each of the files (use the 
standard library's 'hashlib' for this).  Identical files will always have 
identical hashes, but there may be false positives, so you'll need to verify 
that files that have identical hashes are indeed identical.
        b) If your files tend to have sections that are very different (e.g., 
the first 32 bytes tend to be different), then you pretend that section of the 
file is its hash.  You can then do the same trick as above. (the advantage of 
this is that you will read in a lot less data than if you have to hash the 
entire file).
        c) You may be able to do something clever by reading portions of each 
file.  That is, use zip() combined with read(1024) to read each of the files in 
sections, while keeping hashes of the files.  Or, maybe you'll be able to read 
portions of them and sort the list as you're reading.  In either case, if any 
files are NOT identical, then you'll be able to stop work as soon as you figure 
this out, rather than having to read the entire file at once.

The main purpose of these suggestions is to reduce the amount of reading you're 
doing.  Storage tends to be slow, and any tricks that reduce the number of 
bytes you need to read in will be helpful to you.

Good luck!
Cem Karan
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: A sets algorithm

Reply via email to