Re: [ceph-users] 16 osds: 11 up, 16 in

2014-05-10 Thread Craig Lewis
On 5/7/14 15:33 , Dimitri Maziuk wrote: On 05/07/2014 04:11 PM, Craig Lewis wrote: On 5/7/14 13:40 , Sergey Malinin wrote: Check dmesg and SMART data on both nodes. This behaviour is similar to failing hdd. It does sound like a failing disk... but there's nothing in dmesg, and smartmontools

Re: [ceph-users] 16 osds: 11 up, 16 in

2014-05-08 Thread Dimitri Maziuk
On 5/7/2014 7:35 PM, Craig Lewis wrote: Because of the very low recovery parameters, there's on a single backfill running. `iostat -dmx 5 5` did report 100% util on the osd that is backfilling, but I expected that. Once backfilling moves on to a new osd, the 100% util follows the backfill oper

Re: [ceph-users] 16 osds: 11 up, 16 in

2014-05-07 Thread Craig Lewis
On 5/7/14 15:33 , Dimitri Maziuk wrote: On 05/07/2014 04:11 PM, Craig Lewis wrote: On 5/7/14 13:40 , Sergey Malinin wrote: Check dmesg and SMART data on both nodes. This behaviour is similar to failing hdd. It does sound like a failing disk... but there's nothing in dmesg, and smartmontools

Re: [ceph-users] 16 osds: 11 up, 16 in

2014-05-07 Thread Dimitri Maziuk
On 05/07/2014 04:11 PM, Craig Lewis wrote: > On 5/7/14 13:40 , Sergey Malinin wrote: >> Check dmesg and SMART data on both nodes. This behaviour is similar to >> failing hdd. >> >> > > It does sound like a failing disk... but there's nothing in dmesg, and > smartmontools hasn't emailed me about a

Re: [ceph-users] 16 osds: 11 up, 16 in

2014-05-07 Thread Craig Lewis
On 5/7/14 13:40 , Sergey Malinin wrote: Check dmesg and SMART data on both nodes. This behaviour is similar to failing hdd. It does sound like a failing disk... but there's nothing in dmesg, and smartmontools hasn't emailed me about a failing disk. The same thing is happening to more than

Re: [ceph-users] 16 osds: 11 up, 16 in

2014-05-07 Thread Sergey Malinin
Check dmesg and SMART data on both nodes. This behaviour is similar to failing hdd. On Wednesday, May 7, 2014 at 23:28, Craig Lewis wrote: > On 5/7/14 13:15 , Sergey Malinin wrote: > > Is there anything unusual in dmesg at osd.5? > > Nothing in dmesg, but ceph-osd.5.log has plenty. I've att

Re: [ceph-users] 16 osds: 11 up, 16 in

2014-05-07 Thread Sergey Malinin
Is there anything unusual in dmesg at osd.5? On Wednesday, May 7, 2014 at 23:09, Craig Lewis wrote: > I already have osd_max_backfill = 1, and osd_recovery_op_priority = 1. > > osd_recovery_max_active is the default 15, so I'll give that a try... some > OSDs timed out during the injectargs.

Re: [ceph-users] 16 osds: 11 up, 16 in

2014-05-07 Thread Craig Lewis
I already have osd_max_backfill = 1, and osd_recovery_op_priority = 1. osd_recovery_max_active is the default 15, so I'll give that a try... some OSDs timed out during the injectargs. I added it to ceph.conf, and restarted them all. I was running RadosGW-Agent, but it's down now. I disable

Re: [ceph-users] 16 osds: 11 up, 16 in

2014-05-07 Thread Mike Dawson
Craig, I suspect the disks in question are seeking constantly and the spindle contention is causing significant latency. A strategy of throttling backfill/recovery and reducing client traffic tends to work for me. 1) You should make sure recovery and backfill are throttled: ceph tell osd.* in

[ceph-users] 16 osds: 11 up, 16 in

2014-05-07 Thread Craig Lewis
The 5 OSDs that are down have all been kicked out for being unresponsive. The 5 OSDs are getting kicked faster than they can complete the recovery+backfill. The number of degraded PGs is growing over time. root@ceph0c:~# ceph -w cluster 1604ec7a-6ceb-42fc-8c68-0a7896c4e120 health HE