On 06/23/2014 12:54 PM, Greg Poirier wrote:
On Sun, Jun 22, 2014 at 6:44 AM, Mark Nelson <mark.nel...@inktank.com
<mailto:mark.nel...@inktank.com>> wrote:
RBD Cache is definitely going to help in this use case. This test
is basically just sequentially writing a single 16k chunk of data
out, one at a time. IE, entirely latency bound. At least on OSDs
backed by XFS, you have to wait for that data to hit the journals of
every OSD associated with the object before the acknowledgement gets
sent back to the client.
Again, I can reproduce this with replication disabled.
Replication is the less important part of that statement, it's more
about the specific test you are running.
If you are using the default 4MB block size, you'll hit the same
OSDs over and over again and your other OSDs will sit there
twiddling their thumbs waiting for IO until you hit the next block,
but then it will just be a different set OSDs getting hit. You
should be able to verify this by using iostat or collectl or
something to look at the behaviour of the SSDs during the test.
Since this is all sequential though, switching to buffered IO (ie
coalesce IOs at the buffercache layer) or using RBD cache for direct
IO (coalesce IOs below the block device) will dramatically improve
things.
This makes sense.
Given the following scenario:
- No replication
- osd_op time average is .015 seconds (stddev ~.003 seconds)
- Network latency is approximately .000237 seconds on avg
I should be getting 60 IOPS from the OSD reporting this time, right?
So 60 * 16kB = 960kB. That's slightly slower than we're getting because
I'm only able to sample the slowest ops. We're getting closer to 100
IOPS. But that does make sense, I suppose.
So the only way to improve performance would be to not use O_DIRECT (as
this should bypass rbd cache as well, right?).
RBD actually will still work. You can think of it like the cache on a
typical hard drive with similar upsides and downsides. Remember that
O_DIRECT only tries to minimize caching effects by skipping the linux
buffer cache. It doesn't make any guarantees about what happens below
the block level. Having said that, RBD cache should fully respect
flushes and barriers, but there's (typically) no battery so you can't
make any other assumptions beyond that.
http://ceph.com/docs/master/rbd/rbd-config-ref/
Ceph is pretty good at small random IO with lots of parallelism on
spinning disk backed OSDs (So long as you use SSD journals or
controllers with WB cache). It's much harder to get native-level
IOPS rates with SSD backed OSDs though. The latency involved in
distributing and processing all of that data becomes a much bigger
deal. Having said that, we are actively working on improving
latency as much as we can. :)
And this is true because flushing from the journal to spinning disks is
going to coalesce the writes into the appropriate blocks in a meaningful
way, right? Or I guess... Why is this?
Well, for random IO you often can't do much coalescing. You have to
bite the bullet and either parallelize things or reduce per-op latency.
Ceph already handles parallelism very well. You just throw more disks
at the problem and so long as there are enough client requests it more
or less just scales (limited by things like network bisection bandwidth
or other complications). On the latency side, spinning disks aren't
fast enough for Ceph's extra latency overhead to matter much, but with
SSDs the story is different. That's why we are very interested in
reducing latency.
Regarding journals: Journal writes are always sequential (even for
random IO!), but are O_DIRECT so they'll skip linux buffer cache. If
you have hardware that is fast at writing sequential small IO (say a
controller with WB cache or an SSD), you can do journal writes very
quickly. For bursts of small random IO, performance can be quite good.
The downsides is that you can hit journal limits very quickly, meaning
you have to flush and wait for the underlying filestore to catch up.
This results in performance that starts out super fast, then stalls once
the journal limits are hit, back to super fast again for a bit, then
another stall, etc. This is less than ideal given the way crush
distributes data across OSDs. The alternative is setting a soft limit
on how much data is in the journal and flushing smaller amounts of data
more quickly to limit the spikey behaviour. On the whole, that can be
good but limits the burst potential and also limits the amount of data
that could potentially be coalesced in the journal.
Luckily with RBD you can (when applicable) coalesce on the client with
RBD cache instead, which is arguably better anyway since you can send
bigger IOs to the OSDs earlier in the write path. So long as you are ok
with what RBD cache does and does not guarantee, it's definitely worth
enabling imho.
Why doesn't that happen with SSD journals and SSD OSDs?
SSD journals and SSD OSDs should be fine. I suspect in this case it's
just software latency.
Mark
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com