On Sun, 2007-06-24 at 16:58 -0700, dave johnson wrote:
> The most common non-proprietary hash calc for file-level deduplication seems 
> to be the combination of the SHA1 and MD5 together.  Collisions have been 
> shown to exist in MD5 and theoried to exist in SHA1 by extrapolation, but 
> the probibility of collitions occuring simultaneously both is to "small" as 
> the capacity of ZFS is to "large" :)

No.  Collisions in *any* hash function with output smaller than input
are known to exist through information theory.  The tricky part is
finding the collisions without needing to resort to brute force search.

Last I checked, the cryptographers specializing in hash functions are
much less optimistic than this.  

I wouldn't de-duplicate without actually verifying that two blocks were
actually bitwise identical.  

> 
> While computationally intense, this would be a VERY welcome feature addition 
> to ZFS and given the existing infrastructure within the filesystem already, 
> while non-trivial by any means, it seems a prime candidate.  I am not a 
> programmer so I do not have the expertise to spearhead such a movement but I 
> would think getting at least a placeholder "Goals and Objectives" page into 
> the OZFS community pages would be a good start even if movement on this 
> doesn't come for a year or more.
> 
> Thoughts ?
> 
> -=dave
> 
> ----- Original Message ----- 
> From: "Gary Mills" <[EMAIL PROTECTED]>
> To: "Erik Trimble" <[EMAIL PROTECTED]>
> Cc: "Matthew Ahrens" <[EMAIL PROTECTED]>; "roland" <[EMAIL PROTECTED]>; 
> <zfs-discuss@opensolaris.org>
> Sent: Sunday, June 24, 2007 3:58 PM
> Subject: Re: [zfs-discuss] zfs space efficiency
> 
> 
> > On Sun, Jun 24, 2007 at 03:39:40PM -0700, Erik Trimble wrote:
> >> Matthew Ahrens wrote:
> >> >Will Murnane wrote:
> >> >>On 6/23/07, Erik Trimble <[EMAIL PROTECTED]> wrote:
> >> >>>Now, wouldn't it be nice to have syscalls which would implement "cp"
> >> >>>and
> >> >>>"mv", thus abstracting it away from the userland app?
> >
> >> >A "copyfile" primitive would be great!  It would solve the problem of
> >> >having all those "friends" to deal with -- stat(), extended
> >> >attributes, UFS ACLs, NFSv4 ACLs, CIFS attributes, etc.  That isn't to
> >> >say that it would have to be implemented in the kernel; it could
> >> >easily be a library function.
> >> >
> >> I'm with Matt.  Having a "copyfile" library/sys call would be of
> >> significant advantage.  In this case, we can't currently take advantage
> >> of the CoW ability of ZFS when doing 'cp A B'  (as has been pointed out
> >> to me).  'cp' simply opens file A with read(), opens a new file B with
> >> write(), and then shuffles the data between the two.  Now, if we had a
> >> copyfile(A,B) primitive, then the 'cp' binary would simply call this
> >> function, and, depending on the underlying FS, it would get implemented
> >> differently.  In UFS, it would work as it does now. For ZFS, it would
> >> work like a snapshot, where file A and B share data blocks (at least
> >> until someone starts to update either A or B).
> >
> > Isn't this technique an instance of `deduplication', which seems to be
> > a hot idea in storage these days?  I wonder if it could be done
> > automatically, behind the scenes, in some fashion.
> >
> > -- 
> > -Gary Mills-    -Unix Support-    -U of M Academic Computing and 
> > Networking-
> > _______________________________________________
> > zfs-discuss mailing list
> > zfs-discuss@opensolaris.org
> > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
> > 
> 
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss@opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Reply via email to