On Thu, 1 Feb 2006, it was written: > Tom Anderson <[EMAIL PROTECTED]> writes: > >>> The obvious way is make a list of hashes, and sort the list. >> >> Obvious, perhaps, prudent, no. To make the list of hashes, you have to >> read all of every single file first, which could take a while. If your >> files are reasonably random at the beginning, ... > > The possibility of two different mp3 files having the same id3 tags is > something you might specifically be checking for.
So read from the end of the file, rather than the beginning. >> Better yet, note that if two files are identical, they must have the >> same length, and that finding the length can be done very cheaply, so a >> quicker yet approach is to make a list of lengths, sort that, and look >> for duplicates; when you find one, do a byte-by-byte comparison of the >> files (probably terminating in the first block) to see if they really >> are the same. > > Yes, checking the file lengths first is an obvious heuristic, but if you > fine you have a bunch of files with the same length, what do you do? > You're back to a list of hashes. Or prefixes or suffixes. >> By way of example, of the 2690 music files in my iTunes library, i have >> twelve pairs of same-sized files [1], and all of these differ within >> the first 18 bytes (mostly, within the first 9 bytes). > > That's a small enough set of matches that you don't need a general > purpose algorithm. True - and this is *exactly* the situation that the OP was talking about, so this algorithm is appropriate. Moreover, i believe is representative of most situations where you have a bunch of files to compare. Of course, cases where files are tougher to tell apart do exist, but i think they're corner cases. Could you suggest a common kind of file with degenerate lengths, prefixes and suffixes? The only one that springs to mind is a set of same-sized image files in some noncompressed format, recording similar images (frames in a movie, say), where the differences might be buried deep in the pixel data. As it happens, i have just such a dataset on disk: with the images in TIFF format, i get differences between subsequent frames after 9 bytes, but i suspect that's a timestamp or something; if i convert everything to a nice simple BMP (throwing away 8 bits per sample of precision in the process - probably turning most of the pixels to 0!), then i find differences about a megabyte in. If i compare from the tail in, i also have to wade through about a megabyte before finding a difference. Here, hashes would be ideal. tom -- The revolution is here. Get against the wall, sunshine. -- Mike Froggatt -- http://mail.python.org/mailman/listinfo/python-list