Re: [zfs-discuss] ZFS deduplication

Wade . Stuart Tue, 22 Jul 2008 10:59:46 -0700

[EMAIL PROTECTED] wrote on 07/22/2008 11:48:30 AM:

> Chris Cosby wrote:
> >
> >
> > On Tue, Jul 22, 2008 at 11:19 AM, <[EMAIL PROTECTED]
> > <mailto:[EMAIL PROTECTED]>> wrote:
> >
> >     [EMAIL PROTECTED]
> >     <mailto:[EMAIL PROTECTED]> wrote on 07/22/2008
> >     09:58:53 AM:
> >
> >     > To do dedup properly, it seems like there would have to be some
> >     > overly complicated methodology for a sort of delayed dedup of the
> >     > data. For speed, you'd want your writes to go straight into the
> >     > cache and get flushed out as quickly as possibly, keep everything
as
> >     > ACID as possible. Then, a dedup scrubber would take what was
> >     > written, do the voodoo magic of checksumming the new data,
scanning
> >     > the tree to see if there are any matches, locking the duplicates,
> >     > run the usage counters up or down for that block of data,
swapping
> >     > out inodes, and marking the duplicate data as free space.
> >     I agree,  but what you are describing is file based dedup,  ZFS
> >     already has
> >     the groundwork for dedup in the system (block level checksuming and
> >     pointers).
> >
> >     > It's a
> >     > lofty goal, but one that is doable. I guess this is only
necessary
> >     > if deduplication is done at the file level. If done at the block
> >     > level, it could possibly be done on the fly, what with the
already
> >     > implemented checksumming at the block level,
> >
> >     exactly -- that is why it is attractive for ZFS,  so much of the
> >     groundwork
> >     is done and needed for the fs/pool already.
> >
> >     > but then your reads
> >     > will suffer because pieces of files can potentially be spread all
> >     > over hell and half of Georgia on the zdevs.
> >
> >     I don't know that you can make this statement without some study of
an
> >     actual implementation on real world data -- and then because it is
> >     block
> >     based,  you should see varying degrees of this dedup-flack-frag
> >     depending
> >     on data/usage.
> >
> > It's just a NonScientificWAG. I agree that most of the duplicated
> > blocks will in most cases be part of identical files anyway, and thus
> > lined up exactly as you'd want them. I was just free thinking and
typing.
> >
> No, you are right to be concerned over block-level dedup seriously
> impacting seeks.  The problem is that, given many common storage
> scenarios, you will have not just similar files, but multiple common
> sections of many files.  Things such as the various standard
> productivity app documents will not just have the same header sections,
> but internally, there will be significant duplications of considerable
> length with other documents from the same application.  Your 5MB Word
> file is thus likely to share several (actually, many) multi-kB segments
> with other Word files.  You will thus end up seeking all over the disk
> to read _most_ Word files.  Which really sucks.  I can list at least a
> couple more common scenarios where dedup has to potential to save at
> least some reasonable amount of space, yet will absolutely kill
performance.


While you may have a point on some data sets,  actual testing of this type
of data (28.000+ of actual end user doc files) using xdelta with 4k and 8k
block sizes shows that the similar blocks in these files are in the 2%
range (~ 6% for 4k). That means a full read of each file on average would
require < 6% seeks to other disk areas. That is not bad,  but this is the
worst case picture as those duplicate blocks would need to live in the same
offsets and have the same block boundaries to "match" under the proposed
algo. To me this means word docs are not a good candidate for dedup at the
block level -- but the actual cost to dedup anyways seems small.  Of course
you could come up with data that is pathologically bad for these
benchmarks,  but I do not believe it would be nearly as bad as you are
making it out to be on real world data.



>
>
> >     For instance,  I would imagine that in many scenarios much od the
> >     dedup
> >     data blocks would belong to the same or very similar files. In
> >     this case
> >     the blocks were written as best they could on the first write,
> >      the deduped
> >     blocks would point to a pretty sequential line o blocks.  Now on
> >     some files
> >     there may be duplicate header or similar portions of data -- these
may
> >     cause you to jump around the disk; but I do not know how much this
> >     would be
> >     hit or impact real world usage.
> >
> >
> >     > Deduplication is going
> >     > to require the judicious application of hallucinogens and man
hours.
> >     > I expect that someone is up to the task.
> >
> >     I would prefer the coder(s) not be seeing "pink elephants" while
> >     writing
> >     this,  but yes it can and will be done.  It (I believe) will be
easier
> >     after the grow/shrink/evac code paths are in place though. Also,
the
> >     grow/shrink/evac path allows (if it is done right) for other cool
> >     things
> >     like a base to build a roaming defrag that takes into account
snaps,
> >     clones, live and the like.  I know that some feel that the
> >     grow/shrink/evac
> >     code is more important for home users,  but I think that it is
super
> >     important for most of these additional features.
> >
> > The elephants are just there to keep the coders company. There are
> > tons of benefits for dedup, both for home and non-home users. I'm
> > happy that it's going to be done. I expect the first complaints will
> > come from those people who don't understand it, and their df and du
> > numbers look different than their zpool status ones. Perhaps df/du
> > will just have to be faked out for those folks, or we just apply the
> > same hallucinogens to them instead.
> >
> I'm still not convinced that dedup is really worth it for anything but
> very limited, constrained usage. Disk is just so cheap, that you
> _really_ have to have an enormous amount of dup before the performance
> penalties of dedup are countered.

If you can dedup 30% of your data,  your disk just became 30% cheaper.
Depending on workflow, the cost of disk is the barrier -- not cpu cycles or
write/read speed.


>
> This in many ways reminds me the last year's discussion over file
> versioning in the filesystem.  It sounds like a cool idea, but it's not
> a generally-good idea.  I tend to think that this kind of problem is
> better served by applications handling it, if they are concerned about
it.
>

snapping a full filesystem for versions is expensive -- you are dealing
with one file changing.  doing dedup on zfs is inexpensive vs a follow the
writes queue.


> Pretty much, here's what I've heard:
>
> Dedup Advantages:
>
> (1)  save space relative to the amount of duplication.  this is highly
> dependent on workload, and ranges from 0% to 99%, but the distribution
> of possibilities isn't a bell curve (i.e. the average space saved isn't
> 50%).
>
>
> Dedup Disadvantages:
>
> (1)  increase codebase complexity, in both cases of dedup during write,
> and ex-post-facto batched dedup
yes,  but the code path is optional.

>
> (2)  noticable write performance penalty (assuming block-level dedup on
> write), with potential write cache issues.
there is cost,  but smart use of hash lookups and caching should absorb
most of these.  most of the cost comes with using a better hashing algo
instead of fletch2/4


>
> (3)  very significant post-write dedup time, at least on the order of
> 'zfs scrub'. Also, during such a post-write scenario, it more or less
> takes the zpool out of usage.

post write, while not as bad as a separate dedup app, reduces the value of
tying it to zfs.  it should be done inline.

>
> (4) If dedup is done at block level, not at file level, it kills read
> performance, effectively turning all dedup'd files from sequential read
> to a random read.  That is, block-level dedup drastically accelerates
> filesystem fragmentation.

again,  this is completely dependant on the implementation and data sets.
looking at our real world data on a 14tb user file store shows that most
dedup that would happen (using 4, 8, 16 and 128k blocks) happens on totally
binary similar files,  a small percentage of dedup happens on other data if
a static block seek is used (no sliding delta window).




>
> (5)  Something no one has talked about, but is of concern. By removing
> duplication, you increase the likelihood that loss of the "master"
> segment will corrupt many more files. Yes, ZFS has self-healing and
> such.  But, particularly in the case where there is no ZFS pool
> redundancy (or pool-level redundancy has been compromised), loss of one
> block can thus be many more times severe.


I assume that no one has talked about that because it seems obvious. Your
blocks become N times more "valuable" where N is the number of blocks that
are pointed to that block for dedup. A lost block on zfs can therefore
affect N files + X snapshots + Y clones, or the entire filesystem if it was
holding one of a few zfs structures.


>
>
> We need to think long and hard about what the real widespread benefits
> are of dedup before committing to a filesystem-level solution, rather
> than an application-level one.  In particular, we need some real-world
> data on the actual level of duplication under a wide variety of
> circumstances.

There was already a post that shows how to exploit the zfs block checksums
to gather similar block stats. An issue I have with that is zfs default
hashing is pretty collision prone and the data seems suspect.

 I can probably post the perl scripts I used to gather data on my systems.
The hash lookup tables that they generate are pretty damn huge,  but the
reporting part could display relative info in a compact way for posting.
Assumptions I made were fixed block seeks (slurping in the largest block of
data each read and acting on it as all block sizes in the test phase to be
efficient),  md5 match = bin match (pretty safe but a real system would bit
level compare on a hash match).

-Wade

>
> --
> Erik Trimble
> Java System Support
> Mailstop:  usca22-123
> Phone:  x17195
> Santa Clara, CA
> Timezone: US/Pacific (GMT-0800)
>
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss@opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZFS deduplication

Reply via email to