Hi Greg,

>>So the only way to improve performance would be to not use O_DIRECT (as this 
>>should bypass rbd cache as well, right?).

yes, indeed O_DIRECT bypass cache.



BTW, Do you need to use mysql with O_DIRECT ? default innodb_flush_method is 
fdatasync, so it should work with cache.
(but you can lose some write is case of a crash failure)




----- Mail original ----- 

De: "Mark Nelson" <mark.nel...@inktank.com> 
À: "Greg Poirier" <greg.poir...@opower.com> 
Cc: "Zach Dunn" <zach.d...@opower.com>, ceph-users@lists.ceph.com, "Unai 
Basterretxea" <unai.basterret...@opower.com> 
Envoyé: Lundi 23 Juin 2014 21:03:17 
Objet: Re: [ceph-users] Poor performance on all SSD cluster 

On 06/23/2014 12:54 PM, Greg Poirier wrote: 
> On Sun, Jun 22, 2014 at 6:44 AM, Mark Nelson <mark.nel...@inktank.com 
> <mailto:mark.nel...@inktank.com>> wrote: 
> 
> RBD Cache is definitely going to help in this use case. This test 
> is basically just sequentially writing a single 16k chunk of data 
> out, one at a time. IE, entirely latency bound. At least on OSDs 
> backed by XFS, you have to wait for that data to hit the journals of 
> every OSD associated with the object before the acknowledgement gets 
> sent back to the client. 
> 
> 
> Again, I can reproduce this with replication disabled. 

Replication is the less important part of that statement, it's more 
about the specific test you are running. 

> 
> If you are using the default 4MB block size, you'll hit the same 
> OSDs over and over again and your other OSDs will sit there 
> twiddling their thumbs waiting for IO until you hit the next block, 
> but then it will just be a different set OSDs getting hit. You 
> should be able to verify this by using iostat or collectl or 
> something to look at the behaviour of the SSDs during the test. 
> Since this is all sequential though, switching to buffered IO (ie 
> coalesce IOs at the buffercache layer) or using RBD cache for direct 
> IO (coalesce IOs below the block device) will dramatically improve 
> things. 
> 
> 
> This makes sense. 
> 
> Given the following scenario: 
> 
> - No replication 
> - osd_op time average is .015 seconds (stddev ~.003 seconds) 
> - Network latency is approximately .000237 seconds on avg 
> 
> I should be getting 60 IOPS from the OSD reporting this time, right? 
> 
> So 60 * 16kB = 960kB. That's slightly slower than we're getting because 
> I'm only able to sample the slowest ops. We're getting closer to 100 
> IOPS. But that does make sense, I suppose. 
> 
> So the only way to improve performance would be to not use O_DIRECT (as 
> this should bypass rbd cache as well, right?). 

RBD actually will still work. You can think of it like the cache on a 
typical hard drive with similar upsides and downsides. Remember that 
O_DIRECT only tries to minimize caching effects by skipping the linux 
buffer cache. It doesn't make any guarantees about what happens below 
the block level. Having said that, RBD cache should fully respect 
flushes and barriers, but there's (typically) no battery so you can't 
make any other assumptions beyond that. 

http://ceph.com/docs/master/rbd/rbd-config-ref/ 

> 
> Ceph is pretty good at small random IO with lots of parallelism on 
> spinning disk backed OSDs (So long as you use SSD journals or 
> controllers with WB cache). It's much harder to get native-level 
> IOPS rates with SSD backed OSDs though. The latency involved in 
> distributing and processing all of that data becomes a much bigger 
> deal. Having said that, we are actively working on improving 
> latency as much as we can. :) 
> 
> 
> And this is true because flushing from the journal to spinning disks is 
> going to coalesce the writes into the appropriate blocks in a meaningful 
> way, right? Or I guess... Why is this? 

Well, for random IO you often can't do much coalescing. You have to 
bite the bullet and either parallelize things or reduce per-op latency. 
Ceph already handles parallelism very well. You just throw more disks 
at the problem and so long as there are enough client requests it more 
or less just scales (limited by things like network bisection bandwidth 
or other complications). On the latency side, spinning disks aren't 
fast enough for Ceph's extra latency overhead to matter much, but with 
SSDs the story is different. That's why we are very interested in 
reducing latency. 

Regarding journals: Journal writes are always sequential (even for 
random IO!), but are O_DIRECT so they'll skip linux buffer cache. If 
you have hardware that is fast at writing sequential small IO (say a 
controller with WB cache or an SSD), you can do journal writes very 
quickly. For bursts of small random IO, performance can be quite good. 
The downsides is that you can hit journal limits very quickly, meaning 
you have to flush and wait for the underlying filestore to catch up. 
This results in performance that starts out super fast, then stalls once 
the journal limits are hit, back to super fast again for a bit, then 
another stall, etc. This is less than ideal given the way crush 
distributes data across OSDs. The alternative is setting a soft limit 
on how much data is in the journal and flushing smaller amounts of data 
more quickly to limit the spikey behaviour. On the whole, that can be 
good but limits the burst potential and also limits the amount of data 
that could potentially be coalesced in the journal. 

Luckily with RBD you can (when applicable) coalesce on the client with 
RBD cache instead, which is arguably better anyway since you can send 
bigger IOs to the OSDs earlier in the write path. So long as you are ok 
with what RBD cache does and does not guarantee, it's definitely worth 
enabling imho. 

> 
> Why doesn't that happen with SSD journals and SSD OSDs? 

SSD journals and SSD OSDs should be fine. I suspect in this case it's 
just software latency. 

Mark 
_______________________________________________ 
ceph-users mailing list 
ceph-users@lists.ceph.com 
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to