Dan Stromberg <[EMAIL PROTECTED]> writes: > I've been using the following compare function, which in short checks, in > order: > > 1) device number > 2) inode number > 3) file length > 4) the beginning of the file > 5) an md5 hash of the entire file > 6) the entire file > > (If #1 and #2 are identical, then the file must be a hardlink to the other > file. Also, files that do not have the same length can never be > identical. And of course these items are generated on demand, making it > frequently possible to avoid doing full-file-length compare on a lot of > files) > > However, my program is still doing more #6 comparisons than seems
Just don't do any #6 comparisons. If the hashes are identical, the files are identical--that's the point of a cryptographic hash. OK, to be pedantic: 1) there's a microscopic chance of an accidental collision, but it's much lower than the chance of a hardware failure making your comparison function give the wrong answer. 2) there are known cryptanalytic attacks against md5 that can let someone deliberately construct two distinct files with the same hash, with a fair amount of efort. If you care about this, use sha-1 instead, or better yet, sha-256. (sha-1 has an attack similar to the md5 attack, but the amount of work required is not really practical today, unlike the md5 attack). -- http://mail.python.org/mailman/listinfo/python-list