On 06/12/2014 03:44 AM, David wrote:
Hi,

We have 5 OSD servers, with 10 OSDs each (journals on enterprise SSDs).

We lost an OSD and the cluster started to backfill the data to the rest of the 
OSDs - during which the latency skyrocketed on some OSDs and connected clients 
experienced massive IO wait.

I’m trying to rectify the situation now and from what I can tell, these are the 
settings that might help.

osd client op priority
osd recovery op priority
osd max backfills
osd recovery max active

1. Does a high priority value mean it has higher priority? (if the other one 
has lower value) Or does a priority of 1 mean highest priority?
2. I’m running with default on these settings. Does anyone else have any 
experience changing those?

We did some investigation into this a little while back. I suspect you'll see some benefit by reducing backfill/recovery priority and max concurrent operations, but you have to be careful. We found that the higher the number of concurrent client IOs (past the saturation point), the greater relative proportion of throughput is used by client IO. That makes it hard to nail down specific priority and concurrency settings. If your workload requires high throughput and low latency with few client IOs (ie below the saturation point), you may need to overly favour client IO. If you are over-saturating the cluster with many concurrent IOs, you may want to give client IO less priority. If you overly favor client IO when over-saturating the cluster, recovery can take much much longer and client throughput may actually be lower in aggregate. Obviously this isn't ideal, but seems to be what's going on right now.

Mark


Kind Regards,
David Majchrzak
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to