2011-07-09 20:04, Edward Ned Harvey ?????:
--- Performance gain:
Unfortunately there was only one area that I found any performance
gain. When you read back duplicate data that was previously written
with dedup, then you get a lot more cache hits, and as a result, the
reads go faster. Unfortunately these gains are diminished... I don't
know by what... But you only have about 2x to 4x performance gain
reading previously dedup'd data, as compared to reading the same data
which was never dedup'd. Even when repeatedly reading the same file
which is 100% duplicate data (created by dd from /dev/zero) so all the
data is 100% in cache... I still see only 2x to 4x performance gain
with dedup.
First of all, thanks for all the experimental research and results,
even if the outlook is grim. I'd love to see comments about those
systems which use dedup and actually gain benefits, and how
much they gain (i.e. VM farms, etc.), and what may differ in
terms of setup (i.e. at least 256Gb RAM or whatever).
Hopefully the discrepancy between blissful hopes (I had) - that
dedup would save disk space and boost the systems somewhat
kinda like online compression can do - and cruel reality would
result in some improvement project. Perhaps it would be an
offline dedup implementation (perhaps with an online-dedup
option turnable off), as recently discussed on list.
Deleting stuff is still apain though. For the past week my box
is trying to delete an rsynced backup of a linux machine, some
300k files summed up to 50Gb. Deleting large files was rather
quick, but those consuming just a few blocks are really slow.
Even if I batch background RM's so a hundred processes hang
and then they all at once complete in a minute or two.
And quite often the iSCSI initiator or target go crazy so one of
the boxes (or both) have to be rebooted, about trice a day.
I described my setup before, won't clobber it into here ;)
Regarding the low read performance gain, you suggested
in a later post that this could be due to the RAM and disk
bandwidth difference in your machine. I for one think that
(without sufficient ARC block-caching) dedup reading would
suffer greatly also from fragmentation - any one large file
with some or all deduped data is basically guaranteed to
have its blocks scattered across all of your storage.
At least if this file was committed to the deduped pool
late in its life, when most or all of the blocks were already
there.
By the way, did you estimate how much is dedup's overhead
in terms of metadata blocks? For example it was often said
on the list that you shouldn't bother with dedup unless you
data can be deduped 2x or better, and if you're lucky to
already have it on ZFS - you can estimate the reduction
with zdb. Now, I wonder where the number comes from -
is it empirical, or would dedup metadata take approx 1x
the data space, thus under 2x reduction you gain little
or nothing? ;)
Thanks for the research,
//Jim
_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss