On Mon, Feb 8, 2016 at 8:46 AM, Paulo da Silva <p_s_d_a_s_i_l_v_a...@netcabo.pt> wrote: > Hello! > > This may not be a strict python question, but ... > > Suppose I have already a class MyFile that has an efficient method (or > operator) to compare two MyFile s for equality. > > What is the most efficient way to obtain all sets of equal files (of > course each set must have more than one file - all single files are > discarded)? > > Thanks for any suggestions.
Hash them in some way. This has two costs: 1) You need to figure out some hashing algorithm such that any two equal files have the same hash, and ideally, that unequal files will generally have unequal hashes. 2) Hash each file once, and then do your comparisons on the hashes, finally testing for actual equality only on those with the same hash. (The last step takes care of hash collisions - where different files happen to have the same hash - so ideally, you shouldn't have to do this often.) If your definition of "equal" among MyFiles is simply based on the file content, it's easy - just hash the content. But if there are ways for files to be considered equal without being bit-for-bit identical, you'll have to reflect that in the hash. For instance, if you consider files equal if they differ only in whitespace, then you'd need to convert all whitespace to a single space before hashing; or if you don't care about the order of lines in the file, you could either hash the lines separately and sum/xor the hashes, or sort the lines before hashing. But the rest of the job is pretty straight-forward. ChrisA -- https://mail.python.org/mailman/listinfo/python-list