There were a lot of useful details put into the thread "Summary: Dedup and
L2ARC memory requirements"

Please refer to that thread as necessary...  After much discussion leading
up to that thread, I thought I had enough understanding to make dedup
useful, but then in practice, it didn't work out.  Now I've done a lot more
work on it, reduced it all to practice, and I finally feel I can draw up
conclusions that are actually useful:

 

I am testing on a Sun Oracle server, X4270, 1 Xeon 4-core 2.4Ghz, 24G ram,
12 disks ea 2T sas 7.2krpm.  Solaris 11 express snv_151a

 

OS is installed on a single disk.  The remaining 11 disks are all striped
into a single 20 TB pool (no redundancy).  Obviously not the way you would
configure for production, the point is to get maximum usable size and max
performance for testing purposes.  So I can actually find the limits in this
lifetime.

 

With and without dedup, the read and write performance characteristics on
duplicate and unique data are completely different.  That's a lot of
variables.  So here's how I'm going to break it down:  Performance gain of
dedup versus performance loss of dedup.

 

---  Performance gain:

Unfortunately there was only one area that I found any performance gain.
When you read back duplicate data that was previously written with dedup,
then you get a lot more cache hits, and as a result, the reads go faster.
Unfortunately these gains are diminished...  I don't know by what...  But
you only have about 2x to 4x performance gain reading previously dedup'd
data, as compared to reading the same data which was never dedup'd.  Even
when repeatedly reading the same file which is 100% duplicate data (created
by dd from /dev/zero) so all the data is 100% in cache...   I still see only
2x to 4x performance gain with dedup.

 

--- Performance loss:

 

The first conclusion to draw is:  For extremely small pools (say, a few GB
or so) writing with or without dedup performs exactly the same.  As you grow
the unique blocks in the pool, the write performance with dedup starts to
deviate from the write performance without dedup.  It quickly reaches 4x,
6x, 10x slower with dedup ... but this write performance degradation appears
to be logarithmic.  Where I reached 8x write performance degradation was
around 2.4M blocks (290G used) but I never exceeded 11x write performance
degradation even when I got up to 14T used (123M blocks)

 

The second conclusion is:  Out of the box, dedup is basically useless.
Thanks to arc_meta_limit being pathetically small by default, (3840M in my
24G system) I ran into a write performance brick wall around 30M unique
blocks in the system ~= 3.6T unique data in the pool.  When I say "write
performance brick wall," I mean up until that point, dedup writing was about
8x slower than writing without dedup, and after that point, the write
performance difference grew exponentially.  (Maybe it's not mathematically
exponential, but the numbers look exponential to my naked eye.)  I left it
running for about 19 hours, and I never got beyond 5.8T written in the
system.

 

Fortunately, it's really easy to tweak arc_meta_limit.  So in the second
test, that's what I did.  I set the arc_meta_limit so high it would never be
reached.  In this configuration, the previously described logarithmic write
performance degradation continued much higher.  In other words, dedup write
performance was pretty bad, but there was no "brick wall" as previously
described.  Basically I kept writing at this rate till the pool was 13.5T
full (113M blocks), and the whole time, dedup write performance was approx
10x slower than writing without dedup.  At this point, my arc_meta_used
reached 15,500M and it would not grow any more, so I reached a "softer"
brick wall.  I could only conclude that the data being cached in arc was
pushing the metadata out of arc.  But that's only a guess.

 

So the 3rd test was to leave the arc_meta_limit at maximum value, and set
the primarycache to metadata only.  Naturally if you use this configuration
in production, you're likely to have poor read performance because you're
guaranteed you'll never have a data cache hit...  But it could still be a
useful configuration, for example, if you are using dedup on a write-only
backup server.  In this configuration, write performance was much better
than the other configurations.  I reached 3x slower dedup write performance
almost immediately.  And 4x occurred around 47M blocks (5.6T used).  And 5x
occurred around 88M blocks (10.6T used).  It maintained 6x until around 142M
blocks (17T used) and 15,732 M arc_meta_used.  At this point I hit 90% full,
and the whole system basically fell apart, so I disregard all the results
that came later.  Based on the numbers I'm seeing, I have every reason to
believe the system could have continued writing with dedup merely 6x slower
than writing without dedup, if I had not run out of disk space.  At least
theoretically this should continue until I cannot fit the metadata in ram
anymore, and then I'll hit a brick wall... But the only way I can measure
that is to go remove ram from my system and repeat the test.  I don't think
I'll bother.

 

It has been mentioned before, that you suffer a big performance penalty when
you delete things.  This is very very true.  Snapshot destroy, or even rm a
file, and the time to completion is on the same order with the time it took
to initially create all that data.  This is a really big weak point.  It
might take several hours to perform the snapshot destroy of the oldest daily
snapshot.  And naturally that would be likely to occur on a daily basis
every midnight.

_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Reply via email to