Thanks Mark!

Well, our workload has more IOs and quite low throughput, perhaps 10MB/s -> 
100MB/s. It’s a quite mixed workload, but mostly small files (http / mail / 
sql).
During the recovery we had ranged between 600-1000MB/s throughput.

So the only way to currently ”fix” this is to have enough IO to handle both 
recovery and client IOs?
What’s the easiest/best way to add more IOs to a current cluster if you don’t 
want to scale? Add more RAM to OSD servers or add a SSD backed r/w cache tier?

Kind Regards,

David Majchrzak


12 jun 2014 kl. 14:42 skrev Mark Nelson <mark.nel...@inktank.com>:

> On 06/12/2014 03:44 AM, David wrote:
>> Hi,
>> 
>> We have 5 OSD servers, with 10 OSDs each (journals on enterprise SSDs).
>> 
>> We lost an OSD and the cluster started to backfill the data to the rest of 
>> the OSDs - during which the latency skyrocketed on some OSDs and connected 
>> clients experienced massive IO wait.
>> 
>> I’m trying to rectify the situation now and from what I can tell, these are 
>> the settings that might help.
>> 
>> osd client op priority
>> osd recovery op priority
>> osd max backfills
>> osd recovery max active
>> 
>> 1. Does a high priority value mean it has higher priority? (if the other one 
>> has lower value) Or does a priority of 1 mean highest priority?
>> 2. I’m running with default on these settings. Does anyone else have any 
>> experience changing those?
> 
> We did some investigation into this a little while back.  I suspect you'll 
> see some benefit by reducing backfill/recovery priority and max concurrent 
> operations, but you have to be careful.  We found that the higher the number 
> of concurrent client IOs (past the saturation point), the greater relative 
> proportion of throughput is used by client IO. That makes it hard to nail 
> down specific priority and concurrency settings.  If your workload requires 
> high throughput and low latency with few client IOs (ie below the saturation 
> point), you may need to overly favour client IO.  If you are over-saturating 
> the cluster with many concurrent IOs, you may want to give client IO less 
> priority.  If you overly favor client IO when over-saturating the cluster, 
> recovery can take much much longer and client throughput may actually be 
> lower in aggregate.  Obviously this isn't ideal, but seems to be what's going 
> on right now.
> 
> Mark
> 
>> 
>> Kind Regards,
>> David Majchrzak
>> _______________________________________________
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> 
> 
> _______________________________________________
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to