"Dick Davies" <[EMAIL PROTECTED]> wrote on 01/10/2007 05:26:45 AM:
> On 08/01/07, [EMAIL PROTECTED] <[EMAIL PROTECTED]> wrote:
>
> > I think that in addition to lzjb compression, squishing blocks that
contain
> > the same data would buy a lot of space for administrators working in
many
> > common workflows.
>
> This idea has occurred to me too - I think there are definite
> advantages to 'block re-use'.
> When you start talking about multiple similar zones, I suspect
> substantial space savings could
> be made - and if you can re-use that saved storage to provide
> additional redundancy, everyone
> would be happy.
Very true, even on normal fileserver usage I have historically found that
there is 15 -> 30% file level duplication, when added to the cheap snapping
and the already existing compression I think this is a big big win.
>
> > Assumptions:
> >
> > SHA256 hash used (Fletcher2/4 have too many collisions, SHA256 is
2^128 if
> > I remember correctly)
> > SHA256 hash is taken on the data portion of the block as it exists on
disk.
> > the metadata structure is hashed separately.
> > In the current metadata structure, there is a reserved bit portion to
be
> > used in the future.
> >
> >
> > Description of change:
> > Creates:
> > The filesystem goes through its normal process of writing a block, and
> > creating the checksum.
> > Before the step where the metadata tree is pushed, the checksum is
checked
> > against a global checksum tree to see if there is any match.
> > If match exists, insert a metadata placeholder for the block, that
> > references the already existing block on disk, increment a
number_of_links
> > pointer on the metadata blocks to keep track of the pointers pointing
to
> > this block.
> > free up the new block that was written and check-summed to be used in
the
> > future.
> > else if no match, update the checksum tree with the new checksum and
> > continue as normal.
>
> Unless I'm reading this wrong, this sounds a lot like Plan9s 'Venti'
> architecture
> ( http://cm.bell-labs.com/sys/doc/venti.html ) .
>
> But using a hash 'label' seems the wrong approach.
> ZFS is supposed to scale to terrifying levels, and the chances of a
collision,
> however small, works against that. I wouldn't want to trade
> reliability for some extra
> space.
>
That issue has already come up in the thread, SHA256 is 2^128 for random,
2^80 for targeted collisions. That is pretty darn good, but it would also
make sense to perform a rsync like secondary check on match using a
dissimilar crypto hash. If we hit very unlikely chance that 2 blocks match
both sha256 and whatever other secondary hash I think that block should be
lost (act of god). =)
Even with this dual check approach, the index (and the only hash stored)
can still be just the sha256 as the chance for collision is similar to nil
in this context.
_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss