Re: [ceph-users] slow OSD brings down the cluster

2014-08-07 Thread Luis Periquito
Hi Mark, I've been playing with the reweight on 3 of the OSDs (BTW each OSD is backed by a HDD, with a SSD backing all the 4 journals on each host) and these slower ones were given a reweight of 0.5, 0.66 and 0.66. >From what I gathered the reweight would also reduce the number of I/O directed at

Re: [ceph-users] slow OSD brings down the cluster

2014-08-06 Thread Mark Nelson
On 08/06/2014 03:43 AM, Luis Periquito wrote: Hi, In the last few days I've had some issues with the radosgw in which all requests would just stop being served. After some investigation I would go for a single slow OSD. I just restarted that OSD and everything would just go back to work. Every

Re: [ceph-users] slow OSD brings down the cluster

2014-08-06 Thread Sage Weil
You can use the ceph osd perf command to get recent queue latency stats for all OSDs. With a bit of sorting this should quickly tell you if any OSDs are going significantly slower than the others. We'd like to automate this in calamari or perhaps even in the monitor, but it is not immediate

Re: [ceph-users] slow OSD brings down the cluster

2014-08-06 Thread Luis Periquito
Hi Wido, as the backing disk is running a deep scrub it's constantly 100% busy, no errors though... I'm running everything on XFS. I had a similar feeling that was the OSD slowing down those requests. What would be the affected pool? ".rgw"? thanks, On 6 August 2014 10:08, Wido den Hollander

Re: [ceph-users] slow OSD brings down the cluster

2014-08-06 Thread Wido den Hollander
On 08/06/2014 10:43 AM, Luis Periquito wrote: Hi, In the last few days I've had some issues with the radosgw in which all requests would just stop being served. After some investigation I would go for a single slow OSD. I just restarted that OSD and everything would just go back to work. Every

[ceph-users] slow OSD brings down the cluster

2014-08-06 Thread Luis Periquito
Hi, In the last few days I've had some issues with the radosgw in which all requests would just stop being served. After some investigation I would go for a single slow OSD. I just restarted that OSD and everything would just go back to work. Every single time there was a deep scrub running on th