On Sun, 2007-06-24 at 16:58 -0700, dave johnson wrote: > The most common non-proprietary hash calc for file-level deduplication seems > to be the combination of the SHA1 and MD5 together. Collisions have been > shown to exist in MD5 and theoried to exist in SHA1 by extrapolation, but > the probibility of collitions occuring simultaneously both is to "small" as > the capacity of ZFS is to "large" :)
No. Collisions in *any* hash function with output smaller than input are known to exist through information theory. The tricky part is finding the collisions without needing to resort to brute force search. Last I checked, the cryptographers specializing in hash functions are much less optimistic than this. I wouldn't de-duplicate without actually verifying that two blocks were actually bitwise identical. > > While computationally intense, this would be a VERY welcome feature addition > to ZFS and given the existing infrastructure within the filesystem already, > while non-trivial by any means, it seems a prime candidate. I am not a > programmer so I do not have the expertise to spearhead such a movement but I > would think getting at least a placeholder "Goals and Objectives" page into > the OZFS community pages would be a good start even if movement on this > doesn't come for a year or more. > > Thoughts ? > > -=dave > > ----- Original Message ----- > From: "Gary Mills" <[EMAIL PROTECTED]> > To: "Erik Trimble" <[EMAIL PROTECTED]> > Cc: "Matthew Ahrens" <[EMAIL PROTECTED]>; "roland" <[EMAIL PROTECTED]>; > <zfs-discuss@opensolaris.org> > Sent: Sunday, June 24, 2007 3:58 PM > Subject: Re: [zfs-discuss] zfs space efficiency > > > > On Sun, Jun 24, 2007 at 03:39:40PM -0700, Erik Trimble wrote: > >> Matthew Ahrens wrote: > >> >Will Murnane wrote: > >> >>On 6/23/07, Erik Trimble <[EMAIL PROTECTED]> wrote: > >> >>>Now, wouldn't it be nice to have syscalls which would implement "cp" > >> >>>and > >> >>>"mv", thus abstracting it away from the userland app? > > > >> >A "copyfile" primitive would be great! It would solve the problem of > >> >having all those "friends" to deal with -- stat(), extended > >> >attributes, UFS ACLs, NFSv4 ACLs, CIFS attributes, etc. That isn't to > >> >say that it would have to be implemented in the kernel; it could > >> >easily be a library function. > >> > > >> I'm with Matt. Having a "copyfile" library/sys call would be of > >> significant advantage. In this case, we can't currently take advantage > >> of the CoW ability of ZFS when doing 'cp A B' (as has been pointed out > >> to me). 'cp' simply opens file A with read(), opens a new file B with > >> write(), and then shuffles the data between the two. Now, if we had a > >> copyfile(A,B) primitive, then the 'cp' binary would simply call this > >> function, and, depending on the underlying FS, it would get implemented > >> differently. In UFS, it would work as it does now. For ZFS, it would > >> work like a snapshot, where file A and B share data blocks (at least > >> until someone starts to update either A or B). > > > > Isn't this technique an instance of `deduplication', which seems to be > > a hot idea in storage these days? I wonder if it could be done > > automatically, behind the scenes, in some fashion. > > > > -- > > -Gary Mills- -Unix Support- -U of M Academic Computing and > > Networking- > > _______________________________________________ > > zfs-discuss mailing list > > zfs-discuss@opensolaris.org > > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss > > > > _______________________________________________ > zfs-discuss mailing list > zfs-discuss@opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss _______________________________________________ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss