Re: [zfs-discuss] ZFS deduplication

Wade . Stuart Tue, 22 Jul 2008 08:20:28 -0700

[EMAIL PROTECTED] wrote on 07/22/2008 09:58:53 AM:

> To do dedup properly, it seems like there would have to be some
> overly complicated methodology for a sort of delayed dedup of the
> data. For speed, you'd want your writes to go straight into the
> cache and get flushed out as quickly as possibly, keep everything as
> ACID as possible. Then, a dedup scrubber would take what was
> written, do the voodoo magic of checksumming the new data, scanning
> the tree to see if there are any matches, locking the duplicates,
> run the usage counters up or down for that block of data, swapping
> out inodes, and marking the duplicate data as free space.
I agree,  but what you are describing is file based dedup,  ZFS already has
the groundwork for dedup in the system (block level checksuming and
pointers).

> It's a
> lofty goal, but one that is doable. I guess this is only necessary
> if deduplication is done at the file level. If done at the block
> level, it could possibly be done on the fly, what with the already
> implemented checksumming at the block level,

exactly -- that is why it is attractive for ZFS,  so much of the groundwork
is done and needed for the fs/pool already.

> but then your reads
> will suffer because pieces of files can potentially be spread all
> over hell and half of Georgia on the zdevs.

I don't know that you can make this statement without some study of an
actual implementation on real world data -- and then because it is block
based,  you should see varying degrees of this dedup-flack-frag depending
on data/usage.

For instance,  I would imagine that in many scenarios much od the dedup
data blocks would belong to the same or very similar files. In this case
the blocks were written as best they could on the first write,  the deduped
blocks would point to a pretty sequential line o blocks.  Now on some files
there may be duplicate header or similar portions of data -- these may
cause you to jump around the disk; but I do not know how much this would be
hit or impact real world usage.

> Deduplication is going
> to require the judicious application of hallucinogens and man hours.
> I expect that someone is up to the task.

I would prefer the coder(s) not be seeing "pink elephants" while writing
this,  but yes it can and will be done.  It (I believe) will be easier
after the grow/shrink/evac code paths are in place though. Also,  the
grow/shrink/evac path allows (if it is done right) for other cool things
like a base to build a roaming defrag that takes into account snaps,
clones, live and the like.  I know that some feel that the grow/shrink/evac
code is more important for home users,  but I think that it is super
important for most of these additional features.

-Wade

> On Tue, Jul 22, 2008 at 10:39 AM, <[EMAIL PROTECTED]> wrote:
> [EMAIL PROTECTED] wrote on 07/22/2008 08:05:01 AM:
>
> > > Hi All
> > >Is there any hope for deduplication on ZFS ?
> > >Mertol Ozyoney
> > >Storage Practice - Sales Manager
> > >Sun Microsystems
> > > Email [EMAIL PROTECTED]
> >
> > There is always hope.
> >
> > Seriously thought, looking at http://en.wikipedia.
> > org/wiki/Comparison_of_revision_control_software there are a lot of
> > choices of how we could implement this.
> >
> > SVN/K , Mercurial and Sun Teamware all come to mind. Simply ;) merge
> > one of those with ZFS.
> >
> > It _could_ be as simple (with SVN as an example) of using directory
> > listings to produce files which were then 'diffed'. You could then
> > view the diffs as though they were changes made to lines of source
code.
> >
> > Just add a "tree" subroutine to allow you to grab all the diffs that
> > referenced changes to file 'xyz' and you would have easy access to
> > all the changes of a particular file (or directory).
> >
> > With the speed optimized ability added to use ZFS snapshots with the
> > "tree subroutine" to rollback a single file (or directory) you could
> > undo / redo your way through the filesystem.
> >
>

> dedup is not revision control,  you seem to completely misunderstand the
> problem.
>
>
>
> > Using a LKCD
(http://www.faqs.org/docs/Linux-HOWTO/Linux-Crash-HOWTO.html
> > ) you could "sit out" on the play and watch from the sidelines --
> > returning to the OS when you thought you were 'safe' (and if not,
> > jumping backout).
> >

> Now it seems you have veered even further off course.  What are you
> implying the LKCD has to do with zfs, solaris, dedup, let alone revision
> control software?
>
> -Wade
>
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss@opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
>
>
>
> --
> chris -at- microcozm -dot- net
> === Si Hoc Legere Scis Nimium Eruditionis Habes
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss@opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZFS deduplication

Reply via email to