[This is version 2. the first one escaped early by mistake] On Sun, 2007-06-24 at 16:58 -0700, dave johnson wrote: > The most common non-proprietary hash calc for file-level deduplication seems > to be the combination of the SHA1 and MD5 together. Collisions have been > shown to exist in MD5 and theoried to exist in SHA1 by extrapolation, but > the probibility of collitions occuring simultaneously both is to "small" as > the capacity of ZFS is to "large" :)
No. Collisions in *any* hash function with output smaller than input are known to exist through information theory (you can't put kilobytes of information into a 128 or 160 bit bucket) The tricky part lies in finding collisions faster than a brute force search would find them. Last I checked, the cryptographers specializing in hash functions were pessimistic; the breakthroughs in collision-finding by Wang & crew a couple years ago had revealed how little the experts actually knew about building collision-resistant hash functions; the advice to those of us who have come to rely on that hash function property was to migrate now to sha256/sha512 (notably, ZFS uses sha256, not sha1), and then migrate again once the cryptographers felt they had a better grip on the problem; the fear was that the newly discovered attacks would generalize to sha256. But there's another way -- design the system so correct behavior doesn't rely on collisions being impossible to find. I wouldn't de-duplicate without actually verifying that two blocks or files were actually bitwise identical; if you do this, the collision-resistance of the hash function becomes far less important to correctness. - Bill _______________________________________________ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss