Re: [ceph-users] Not recovering completely on OSD failure

Gregory Farnum Fri, 08 Nov 2013 09:48:43 -0800

This is probably a result of some difficulties that CRUSH has when using
pool sizes equal to the total number of buckets it can choose from. We made
some changes to the algorithm earlier this year to deal with it, but if
using a kernel client you need a very new one to be compatible so we
haven't enabled them by default yet -- see the documentation on "crush
tunables".
-Greg


On Friday, November 8, 2013, Niklas Goerke wrote:

> Hi guys
>
> This is probably a configuration error, but I just can't find it.
> The following reproduceable happens on my cluster [1].
>
> 15:52:15 On Host1 one disk is being removed on the RAID Controller (to
> ceph it looks as if the disk died)
> 15:52:52 OSD Reported missing (osd.47)
> 15:52:53 osdmap eXXX: 60 osds: 59 up, 60 in; 1,781% degraded, 436 PGs
> stuck unclean, 436 PGs degraded; not recovering yet
> 15:57:54 osdmap eXXX: 60 osds: 59 up, 59 in; start recovering
> 15:58:00 2,502% degraded
> 15:58:01 3,413% degraded; recovering at about 1GB/s --> recovering speed
> decreasing to about 40MB/s
> 17:02:10 10 PGs active+remapped, 218 PGs active+degraded, 0.898% degraded,
> stopped recovering
> 18:12 Still not recovering
> few days later: OSD removed [2], now recovering completely
>
> I would like my cluster to recover completely without me interfering. Can
> anyone give an educated guess what went wrong here? I can't find the reason
> why the cluster would just stop recovering.
>
> Thank you for any hints!
> Niklas
>
>
> [1] 4 OSD Hosts with 15 disks each. On each of the 60 identical disks
> there is one OSD. I have one large pool with 6000 PGs and a replica size of
> 4, and 3 (default) pools with 64 PGs each
> [2] http://ceph.com/docs/master/rados/operations/add-or-rm-
> osds/#removing-osds-manual
> _______________________________________________
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>


-- 
Software Engineer #42 @ http://inktank.com | http://ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Not recovering completely on OSD failure

Reply via email to