Paul Rubin wrote: > "EP" <[EMAIL PROTECTED]> writes: > > Given that I am looking for matches of all files against all other > > files (of similar length) is there a better bet than using re.search? > > The initial application concerns files in the 1,000's, and I could use > > a good solution for a number of files in the 100,000's. > > If these are text files, typically you'd use the Unix 'diff' utility > to locate the differences.
If you can, you definitely want to use diff. Otherwise, the difflib standard library module may be of use to you. Also, since you're talking about comparing many files to each other, you could pull out a substring of one file and use the 'in' "operator" to check if that substring is in another file. Something like this: f = open(filename) # or if binary open(filename, 'rb') f.seek(somewhere_in_the_file) substr = f.read(some_amount_of_data) f.close() try_diffing_us = [] for fn in list_of_filenames: data = open(fn).read() # or again open(fn, 'rb')... if substr in data: try_diffing_us.append(fn) # then diff just those filenames... That's a naive implementation but it should illustrate how to cut down on the number of actual diffs you'll need to perform. Of course, if your files are large it may not be feasible to do this with all of them. But they'd have to be really large, or there'd have to be lots and lots of them... :-) More information on your actual use case would be helpful in narrowing down the best options. Peace, ~Simon -- http://mail.python.org/mailman/listinfo/python-list