On Sun, Jun 22, 2014 at 6:44 AM, Mark Nelson <mark.nel...@inktank.com> wrote:
> RBD Cache is definitely going to help in this use case. This test is > basically just sequentially writing a single 16k chunk of data out, one at > a time. IE, entirely latency bound. At least on OSDs backed by XFS, you > have to wait for that data to hit the journals of every OSD associated with > the object before the acknowledgement gets sent back to the client. > Again, I can reproduce this with replication disabled. > If you are using the default 4MB block size, you'll hit the same OSDs > over and over again and your other OSDs will sit there twiddling their > thumbs waiting for IO until you hit the next block, but then it will just > be a different set OSDs getting hit. You should be able to verify this by > using iostat or collectl or something to look at the behaviour of the SSDs > during the test. Since this is all sequential though, switching to > buffered IO (ie coalesce IOs at the buffercache layer) or using RBD cache > for direct IO (coalesce IOs below the block device) will dramatically > improve things. > This makes sense. Given the following scenario: - No replication - osd_op time average is .015 seconds (stddev ~.003 seconds) - Network latency is approximately .000237 seconds on avg I should be getting 60 IOPS from the OSD reporting this time, right? So 60 * 16kB = 960kB. That's slightly slower than we're getting because I'm only able to sample the slowest ops. We're getting closer to 100 IOPS. But that does make sense, I suppose. So the only way to improve performance would be to not use O_DIRECT (as this should bypass rbd cache as well, right?). > Ceph is pretty good at small random IO with lots of parallelism on > spinning disk backed OSDs (So long as you use SSD journals or controllers > with WB cache). It's much harder to get native-level IOPS rates with SSD > backed OSDs though. The latency involved in distributing and processing > all of that data becomes a much bigger deal. Having said that, we are > actively working on improving latency as much as we can. :) And this is true because flushing from the journal to spinning disks is going to coalesce the writes into the appropriate blocks in a meaningful way, right? Or I guess... Why is this? Why doesn't that happen with SSD journals and SSD OSDs?
_______________________________________________ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com