Hi, I'm a bit green in this area and wonder to what extent there may be some existing Python tools (or if I have to scratch my head real hard for an appropriate algorithm... ) I'd hate to build an inferior solution to that someone has painstakingly built before me.
I have some files which may have had the same origin, but some may have had some cruft added to the front, and some may have had some cruft added to the back; thus they may be of slightly different lengths, but if they had the same origin, there will be a matching pattern of bytes in the middle, though it may be offset relative to each other. I want to find which files have in common with which other files the same pattern of origin within them. The cruft portions should be a small % of the overall file lengths. Given that I am looking for matches of all files against all other files (of similar length) is there a better bet than using re.search? The initial application concerns files in the 1,000's, and I could use a good solution for a number of files in the 100,000's. TIA for bearing with my ignorance of the clear solution I'm surely blind to... EP -- http://mail.python.org/mailman/listinfo/python-list