Thanks Mark! Well, our workload has more IOs and quite low throughput, perhaps 10MB/s -> 100MB/s. It’s a quite mixed workload, but mostly small files (http / mail / sql). During the recovery we had ranged between 600-1000MB/s throughput.
So the only way to currently ”fix” this is to have enough IO to handle both recovery and client IOs? What’s the easiest/best way to add more IOs to a current cluster if you don’t want to scale? Add more RAM to OSD servers or add a SSD backed r/w cache tier? Kind Regards, David Majchrzak 12 jun 2014 kl. 14:42 skrev Mark Nelson <mark.nel...@inktank.com>: > On 06/12/2014 03:44 AM, David wrote: >> Hi, >> >> We have 5 OSD servers, with 10 OSDs each (journals on enterprise SSDs). >> >> We lost an OSD and the cluster started to backfill the data to the rest of >> the OSDs - during which the latency skyrocketed on some OSDs and connected >> clients experienced massive IO wait. >> >> I’m trying to rectify the situation now and from what I can tell, these are >> the settings that might help. >> >> osd client op priority >> osd recovery op priority >> osd max backfills >> osd recovery max active >> >> 1. Does a high priority value mean it has higher priority? (if the other one >> has lower value) Or does a priority of 1 mean highest priority? >> 2. I’m running with default on these settings. Does anyone else have any >> experience changing those? > > We did some investigation into this a little while back. I suspect you'll > see some benefit by reducing backfill/recovery priority and max concurrent > operations, but you have to be careful. We found that the higher the number > of concurrent client IOs (past the saturation point), the greater relative > proportion of throughput is used by client IO. That makes it hard to nail > down specific priority and concurrency settings. If your workload requires > high throughput and low latency with few client IOs (ie below the saturation > point), you may need to overly favour client IO. If you are over-saturating > the cluster with many concurrent IOs, you may want to give client IO less > priority. If you overly favor client IO when over-saturating the cluster, > recovery can take much much longer and client throughput may actually be > lower in aggregate. Obviously this isn't ideal, but seems to be what's going > on right now. > > Mark > >> >> Kind Regards, >> David Majchrzak >> _______________________________________________ >> ceph-users mailing list >> ceph-users@lists.ceph.com >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >> > > _______________________________________________ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com _______________________________________________ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com