Re: [Openstack-operators] Ceph recovery going unusually slow

George Mihaiescu Fri, 02 Jun 2017 05:10:26 -0700

Having 9 ceph-mon servers doesn't help...

I would look at the stuck PGs in order to find the common OSDs and focus on 
them.


Their logs will  probably have details on where the problem is.

> On Jun 2, 2017, at 07:51, Grant Morley <[email protected]> wrote:
> 
> Hi All,
> 
> I wonder if anyone could help at all.
> 
> We were doing some routine maintenance on our ceph cluster and after running 
> a "service ceph-all restart" on one of our nodes we noticed that something 
> wasn't quite right. The cluster has gone into an error mode and we have 
> multiple stuck PGs and the object replacement recovery is taking a strangely 
> long time. At first there was about 46% objects misplaced and we now have 
> roughly 16%.
> 
> However it has taken about 36 hours to do the recovery so far and with a 
> possible 16 to go we are looking at a fairly major issue. As a lot of the 
> system is now blocked for read / writes, customers cannot access their VMs.
> 
> I think the main issue at the moment is that we have 210pgs stuck inactive 
> and nothing we seem to do can get them to peer.
> 
> Below is an ouptut of the ceph status. Can anyone help or have any ideas on 
> how to speed up the recover process? We have tried turning down logging on 
> the OSD's but some are going so slow they wont allow us to injectargs into 
> them.
> 
> health HEALTH_ERR
>             210 pgs are stuck inactive for more than 300 seconds
>             298 pgs backfill_wait
>             3 pgs backfilling
>             1 pgs degraded
>             200 pgs peering
>             1 pgs recovery_wait
>             1 pgs stuck degraded
>             210 pgs stuck inactive
>             512 pgs stuck unclean
>             3310 requests are blocked > 32 sec
>             recovery 2/11094405 objects degraded (0.000%)
>             recovery 1785063/11094405 objects misplaced (16.090%)
>             nodown,noout,noscrub,nodeep-scrub flag(s) set
> 
>             election epoch 16314, quorum 0,1,2,3,4,5,6,7,8 
> storage-1,storage-2,storage-3,storage-4,storage-5,storage-6,storage-7,storage-8,storage-9
>      osdmap e213164: 54 osds: 54 up, 54 in; 329 remapped pgs
>             flags nodown,noout,noscrub,nodeep-scrub
>       pgmap v41030942: 2036 pgs, 14 pools, 14183 GB data, 3309 kobjects
>             43356 GB used, 47141 GB / 90498 GB avail
>             2/11094405 objects degraded (0.000%)
>             1785063/11094405 objects misplaced (16.090%)
>                 1524 active+clean
>                  298 active+remapped+wait_backfill
>                  153 peering
>                   47 remapped+peering
>                   10 inactive
>                    3 active+remapped+backfilling
>                    1 active+recovery_wait+degraded+remapped
> 
> Many thanks,
> 
> Grant
> 
> 
> _______________________________________________
> OpenStack-operators mailing list
> [email protected]
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators

_______________________________________________
OpenStack-operators mailing list
[email protected]
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators

Re: [Openstack-operators] Ceph recovery going unusually slow

Reply via email to