On 06/12/2014 03:44 AM, David wrote:
Hi,
We have 5 OSD servers, with 10 OSDs each (journals on enterprise SSDs).
We lost an OSD and the cluster started to backfill the data to the rest of the
OSDs - during which the latency skyrocketed on some OSDs and connected clients
experienced massive IO wait.
I’m trying to rectify the situation now and from what I can tell, these are the
settings that might help.
osd client op priority
osd recovery op priority
osd max backfills
osd recovery max active
1. Does a high priority value mean it has higher priority? (if the other one
has lower value) Or does a priority of 1 mean highest priority?
2. I’m running with default on these settings. Does anyone else have any
experience changing those?
We did some investigation into this a little while back. I suspect
you'll see some benefit by reducing backfill/recovery priority and max
concurrent operations, but you have to be careful. We found that the
higher the number of concurrent client IOs (past the saturation point),
the greater relative proportion of throughput is used by client IO.
That makes it hard to nail down specific priority and concurrency
settings. If your workload requires high throughput and low latency
with few client IOs (ie below the saturation point), you may need to
overly favour client IO. If you are over-saturating the cluster with
many concurrent IOs, you may want to give client IO less priority. If
you overly favor client IO when over-saturating the cluster, recovery
can take much much longer and client throughput may actually be lower in
aggregate. Obviously this isn't ideal, but seems to be what's going on
right now.
Mark
Kind Regards,
David Majchrzak
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com