Re: [ceph-users] Poor performance on all SSD cluster

Mark Nelson Mon, 23 Jun 2014 12:04:31 -0700

On 06/23/2014 12:54 PM, Greg Poirier wrote:

On Sun, Jun 22, 2014 at 6:44 AM, Mark Nelson <mark.nel...@inktank.com
<mailto:mark.nel...@inktank.com>> wrote:


    RBD Cache is definitely going to help in this use case.  This test
    is basically just sequentially writing a single 16k chunk of data
    out, one at a time.  IE, entirely latency bound.  At least on OSDs
    backed by XFS, you have to wait for that data to hit the journals of
    every OSD associated with the object before the acknowledgement gets
    sent back to the client.


Again, I can reproduce this with replication disabled.

Replication is the less important part of that statement, it's moreabout the specific test you are running.


      If you are using the default 4MB block size, you'll hit the same
    OSDs over and over again and your other OSDs will sit there
    twiddling their thumbs waiting for IO until you hit the next block,
    but then it will just be a different set OSDs getting hit.  You
    should be able to verify this by using iostat or collectl or
    something to look at the behaviour of the SSDs during the test.
      Since this is all sequential though, switching to  buffered IO (ie
    coalesce IOs at the buffercache layer) or using RBD cache for direct
    IO (coalesce IOs below the block device) will dramatically improve
    things.


This makes sense.

Given the following scenario:

- No replication
- osd_op time average is .015 seconds (stddev ~.003 seconds)
- Network latency is approximately .000237 seconds on avg

I should be getting 60 IOPS from the OSD reporting this time, right?

So 60 * 16kB = 960kB.  That's slightly slower than we're getting because
I'm only able to sample the slowest ops. We're getting closer to 100
IOPS. But that does make sense, I suppose.

So the only way to improve performance would be to not use O_DIRECT (as
this should bypass rbd cache as well, right?).

RBD actually will still work. You can think of it like the cache on atypical hard drive with similar upsides and downsides. Remember thatO_DIRECT only tries to minimize caching effects by skipping the linuxbuffer cache. It doesn't make any guarantees about what happens belowthe block level. Having said that, RBD cache should fully respectflushes and barriers, but there's (typically) no battery so you can'tmake any other assumptions beyond that.


http://ceph.com/docs/master/rbd/rbd-config-ref/


    Ceph is pretty good at small random IO with lots of parallelism on
    spinning disk backed OSDs (So long as you use SSD journals or
    controllers with WB cache).  It's much harder to get native-level
    IOPS rates with SSD backed OSDs though.  The latency involved in
    distributing and processing all of that data becomes a much bigger
    deal.  Having said that, we are actively working on improving
    latency as much as we can. :)


And this is true because flushing from the journal to spinning disks is
going to coalesce the writes into the appropriate blocks in a meaningful
way, right? Or I guess... Why is this?

Well, for random IO you often can't do much coalescing. You have tobite the bullet and either parallelize things or reduce per-op latency.Ceph already handles parallelism very well. You just throw more disksat the problem and so long as there are enough client requests it moreor less just scales (limited by things like network bisection bandwidthor other complications). On the latency side, spinning disks aren'tfast enough for Ceph's extra latency overhead to matter much, but withSSDs the story is different. That's why we are very interested inreducing latency.

Regarding journals: Journal writes are always sequential (even forrandom IO!), but are O_DIRECT so they'll skip linux buffer cache. Ifyou have hardware that is fast at writing sequential small IO (say acontroller with WB cache or an SSD), you can do journal writes veryquickly. For bursts of small random IO, performance can be quite good.The downsides is that you can hit journal limits very quickly, meaningyou have to flush and wait for the underlying filestore to catch up.This results in performance that starts out super fast, then stalls oncethe journal limits are hit, back to super fast again for a bit, thenanother stall, etc. This is less than ideal given the way crushdistributes data across OSDs. The alternative is setting a soft limiton how much data is in the journal and flushing smaller amounts of datamore quickly to limit the spikey behaviour. On the whole, that can begood but limits the burst potential and also limits the amount of datathat could potentially be coalesced in the journal.

Luckily with RBD you can (when applicable) coalesce on the client withRBD cache instead, which is arguably better anyway since you can sendbigger IOs to the OSDs earlier in the write path. So long as you are okwith what RBD cache does and does not guarantee, it's definitely worthenabling imho.


Why doesn't that happen with SSD journals and SSD OSDs?

SSD journals and SSD OSDs should be fine. I suspect in this case it'sjust software latency.


Mark
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Poor performance on all SSD cluster

Reply via email to