If you want to avoid an O(n^2) algorithm, you may need to find a signature for each file. Then you use such signatures to compute hashes, and unique them with a dict (dict values may be the file names). Later you can weed out the few wrong collisions produced by the possibly approximated signature. If you can determine of a given a file where its heading garbage stops, then you can compute the signature just computing the python hash of some of the following bytes (30 or 200 byte may suffice).
Bye, bearophile -- http://mail.python.org/mailman/listinfo/python-list