Re: [zfs-discuss] zfs space efficiency

Bill Sommerfeld Mon, 25 Jun 2007 12:16:33 -0700

[This is version 2.  the first one escaped early by mistake]

On Sun, 2007-06-24 at 16:58 -0700, dave johnson wrote:
> The most common non-proprietary hash calc for file-level deduplication seems 
> to be the combination of the SHA1 and MD5 together.  Collisions have been 
> shown to exist in MD5 and theoried to exist in SHA1 by extrapolation, but 
> the probibility of collitions occuring simultaneously both is to "small" as 
> the capacity of ZFS is to "large" :)


No.  Collisions in *any* hash function with output smaller than input
are known to exist through information theory (you can't put kilobytes
of information into a 128 or 160 bit bucket) The tricky part lies in
finding collisions faster than a brute force search would find them.

Last I checked, the cryptographers specializing in hash functions were
pessimistic; the breakthroughs in collision-finding by Wang & crew a
couple years ago had revealed how little the experts actually knew about
building collision-resistant hash functions; the advice to those of us
who have come to rely on that hash function property was to migrate now
to sha256/sha512 (notably, ZFS uses sha256, not sha1), and then migrate
again once the cryptographers felt they had a better grip on the
problem; the fear was that the newly discovered attacks would generalize
to sha256.

But there's another way -- design the system so correct behavior doesn't
rely on collisions being impossible to find.

I wouldn't de-duplicate without actually verifying that two blocks or
files were actually bitwise identical; if you do this, the
collision-resistance of the hash function becomes far less important to
correctness.  

                                        - Bill




_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] zfs space efficiency

Reply via email to