On 06/23/2014 12:54 PM, Greg Poirier wrote:
On Sun, Jun 22, 2014 at 6:44 AM, Mark Nelson <mark.nel...@inktank.com
<mailto:mark.nel...@inktank.com>> wrote:

    RBD Cache is definitely going to help in this use case.  This test
    is basically just sequentially writing a single 16k chunk of data
    out, one at a time.  IE, entirely latency bound.  At least on OSDs
    backed by XFS, you have to wait for that data to hit the journals of
    every OSD associated with the object before the acknowledgement gets
    sent back to the client.


Again, I can reproduce this with replication disabled.

Replication is the less important part of that statement, it's more about the specific test you are running.


      If you are using the default 4MB block size, you'll hit the same
    OSDs over and over again and your other OSDs will sit there
    twiddling their thumbs waiting for IO until you hit the next block,
    but then it will just be a different set OSDs getting hit.  You
    should be able to verify this by using iostat or collectl or
    something to look at the behaviour of the SSDs during the test.
      Since this is all sequential though, switching to  buffered IO (ie
    coalesce IOs at the buffercache layer) or using RBD cache for direct
    IO (coalesce IOs below the block device) will dramatically improve
    things.


This makes sense.

Given the following scenario:

- No replication
- osd_op time average is .015 seconds (stddev ~.003 seconds)
- Network latency is approximately .000237 seconds on avg

I should be getting 60 IOPS from the OSD reporting this time, right?

So 60 * 16kB = 960kB.  That's slightly slower than we're getting because
I'm only able to sample the slowest ops. We're getting closer to 100
IOPS. But that does make sense, I suppose.

So the only way to improve performance would be to not use O_DIRECT (as
this should bypass rbd cache as well, right?).

RBD actually will still work. You can think of it like the cache on a typical hard drive with similar upsides and downsides. Remember that O_DIRECT only tries to minimize caching effects by skipping the linux buffer cache. It doesn't make any guarantees about what happens below the block level. Having said that, RBD cache should fully respect flushes and barriers, but there's (typically) no battery so you can't make any other assumptions beyond that.

http://ceph.com/docs/master/rbd/rbd-config-ref/


    Ceph is pretty good at small random IO with lots of parallelism on
    spinning disk backed OSDs (So long as you use SSD journals or
    controllers with WB cache).  It's much harder to get native-level
    IOPS rates with SSD backed OSDs though.  The latency involved in
    distributing and processing all of that data becomes a much bigger
    deal.  Having said that, we are actively working on improving
    latency as much as we can. :)


And this is true because flushing from the journal to spinning disks is
going to coalesce the writes into the appropriate blocks in a meaningful
way, right? Or I guess... Why is this?

Well, for random IO you often can't do much coalescing. You have to bite the bullet and either parallelize things or reduce per-op latency. Ceph already handles parallelism very well. You just throw more disks at the problem and so long as there are enough client requests it more or less just scales (limited by things like network bisection bandwidth or other complications). On the latency side, spinning disks aren't fast enough for Ceph's extra latency overhead to matter much, but with SSDs the story is different. That's why we are very interested in reducing latency.

Regarding journals: Journal writes are always sequential (even for random IO!), but are O_DIRECT so they'll skip linux buffer cache. If you have hardware that is fast at writing sequential small IO (say a controller with WB cache or an SSD), you can do journal writes very quickly. For bursts of small random IO, performance can be quite good. The downsides is that you can hit journal limits very quickly, meaning you have to flush and wait for the underlying filestore to catch up. This results in performance that starts out super fast, then stalls once the journal limits are hit, back to super fast again for a bit, then another stall, etc. This is less than ideal given the way crush distributes data across OSDs. The alternative is setting a soft limit on how much data is in the journal and flushing smaller amounts of data more quickly to limit the spikey behaviour. On the whole, that can be good but limits the burst potential and also limits the amount of data that could potentially be coalesced in the journal.

Luckily with RBD you can (when applicable) coalesce on the client with RBD cache instead, which is arguably better anyway since you can send bigger IOs to the OSDs earlier in the write path. So long as you are ok with what RBD cache does and does not guarantee, it's definitely worth enabling imho.


Why doesn't that happen with SSD journals and SSD OSDs?

SSD journals and SSD OSDs should be fine. I suspect in this case it's just software latency.

Mark
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to