On Sun, Jun 22, 2014 at 6:44 AM, Mark Nelson <mark.nel...@inktank.com>
wrote:

> RBD Cache is definitely going to help in this use case.  This test is
> basically just sequentially writing a single 16k chunk of data out, one at
> a time.  IE, entirely latency bound.  At least on OSDs backed by XFS, you
> have to wait for that data to hit the journals of every OSD associated with
> the object before the acknowledgement gets sent back to the client.
>

Again, I can reproduce this with replication disabled.


>  If you are using the default 4MB block size, you'll hit the same OSDs
> over and over again and your other OSDs will sit there twiddling their
> thumbs waiting for IO until you hit the next block, but then it will just
> be a different set OSDs getting hit.  You should be able to verify this by
> using iostat or collectl or something to look at the behaviour of the SSDs
> during the test.  Since this is all sequential though, switching to
>  buffered IO (ie coalesce IOs at the buffercache layer) or using RBD cache
> for direct IO (coalesce IOs below the block device) will dramatically
> improve things.
>

This makes sense.

Given the following scenario:

- No replication
- osd_op time average is .015 seconds (stddev ~.003 seconds)
- Network latency is approximately .000237 seconds on avg

I should be getting 60 IOPS from the OSD reporting this time, right?

So 60 * 16kB = 960kB.  That's slightly slower than we're getting because
I'm only able to sample the slowest ops. We're getting closer to 100 IOPS.
But that does make sense, I suppose.

So the only way to improve performance would be to not use O_DIRECT (as
this should bypass rbd cache as well, right?).


> Ceph is pretty good at small random IO with lots of parallelism on
> spinning disk backed OSDs (So long as you use SSD journals or controllers
> with WB cache).  It's much harder to get native-level IOPS rates with SSD
> backed OSDs though.  The latency involved in distributing and processing
> all of that data becomes a much bigger deal.  Having said that, we are
> actively working on improving latency as much as we can. :)


And this is true because flushing from the journal to spinning disks is
going to coalesce the writes into the appropriate blocks in a meaningful
way, right? Or I guess... Why is this?

Why doesn't that happen with SSD journals and SSD OSDs?
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to