Re: [Openstack-operators] Ceph recovery going unusually slow

Saverio Proto Fri, 02 Jun 2017 06:00:37 -0700

I would start troubleshooting restarting those OSDs that dont respond
to ceph pg XXXX query.


Check the log files of those OSDs when you restart.

Saverio

2017-06-02 14:42 GMT+02:00 Grant Morley <[email protected]>:
> We was just putting a security patch on that got released.
>
> After that on one of the nodes we ran a service ceph-all restart and the
> cluster went into error status.
>
> What we have been doing to query the pgs so far is:
>
> We get a list of the PGs from the output of ceph health detail
>
> or we have used: ceph pg dump_stuck inactive
>
> we then have a list of about 200 PGs and they are either peering or
> remapped+peering or inactive.
>
> When we query those we use: ceph pg PGNUM query
>
> Sometimes this replies with something like:
>
>
> https://pastebin.com/m3bH34RB
>
>
> Or we get a huge output that seems to be a sequence of peering and then
> remapping , then end of which looks like this:
>
> https://pastebin.com/EwW5rftq
>
> Or we get no reply at all and it just hangs forever.
>
> So, we have OSDs that hang forever when contacted directly using: ceph pg
> XXXX query.  If we look at the OSDs then we also have OSDs that are not
> responding to queries such as: ceph tell osd.24 version - which also hangs
> forever.  If we restart the OSD service it can reply and then hang again
> forever.
>
> We may have more than 1 problem.  1 OSDs hanging on queries such as a
> simple: ceph tell osd.XX version
>
> What causes that?
>
> The other is the PGs that are not peering correctly, the NICs are all
> correct, we tested the network connection and it is working and the ports
> are open, but the peering process is not working between the OSDs for some
> PGs and we have been unable to unstick it.
>
> Thanks,
>
>
> On Fri, Jun 2, 2017 at 1:18 PM, Tyanko Aleksiev <[email protected]>
> wrote:
>>
>> Additionally, it could be useful to know what did you do during the
>> maintenance.
>>
>> Cheers,
>> Tyanko
>>
>> On 2 June 2017 at 14:08, Saverio Proto <[email protected]> wrote:
>>>
>>> To give you some help you need to tell us the ceph version you are
>>> using and from ceph.conf in the section [osd] what values you have for
>>> the following ?
>>>
>>> [osd]
>>> osd max backfills
>>> osd recovery max active
>>> osd recovery op priority
>>>
>>> these three settings can influence the recovery speed.
>>>
>>> Also, do you have big enough limits ?
>>>
>>> Check on any host the content of: /proc/`pid_of_the_osd`/limits
>>>
>>>
>>> Saverio
>>>
>>> 2017-06-02 14:00 GMT+02:00 Grant Morley <[email protected]>:
>>> > HEALTH_ERR 210 pgs are stuck inactive for more than 300 seconds; 296
>>> > pgs
>>> > backfill_wait; 3 pgs backfilling; 1 pgs degraded; 202 pgs peering; 1
>>> > pgs
>>> > recovery_wait; 1 pgs stuck degraded; 210 pgs stuck inactive; 510 pgs
>>> > stuck
>>> > unclean; 3308 requests are blocked > 32 sec; 41 osds have slow
>>> > requests;
>>> > recovery 2/11091408 objects degraded (0.000%); recovery
>>> > 1778127/11091408
>>> > objects misplaced (16.032%); nodown,noout,noscrub,nodeep-scrub flag(s)
>>> > set
>>> >
>>> > pg 3.235 is stuck inactive for 138232.508429, current state peering,
>>> > last
>>> > acting [11,26,1]
>>> > pg 1.237 is stuck inactive for 138260.482588, current state peering,
>>> > last
>>> > acting [8,41,34]
>>> > pg 2.231 is stuck inactive for 138258.316031, current state peering,
>>> > last
>>> > acting [24,53,8]
>>> > pg 2.22e is stuck inactive for 194033.321591, current state
>>> > remapped+peering, last acting [0,29,1]
>>> > pg 1.22c is stuck inactive for 102514.200154, current state peering,
>>> > last
>>> > acting [51,7,20]
>>> > pg 2.228 is stuck inactive for 138258.317797, current state peering,
>>> > last
>>> > acting [53,4,34]
>>> > pg 1.227 is stuck inactive for 138258.244681, current state
>>> > remapped+peering, last acting [48,35,11]
>>> > pg 2.220 is stuck inactive for 193940.066322, current state
>>> > remapped+peering, last acting [9,39,8]
>>> > pg 1.222 is stuck inactive for 101474.087688, current state peering,
>>> > last
>>> > acting [23,11,35]
>>> > pg 3.130 is stuck inactive for 99735.451290, current state peering,
>>> > last
>>> > acting [27,37,17]
>>> > pg 3.136 is stuck inactive for 138221.552865, current state peering,
>>> > last
>>> > acting [26,49,10]
>>> > pg 3.13c is stuck inactive for 137563.906503, current state peering,
>>> > last
>>> > acting [51,53,7]
>>> > pg 2.142 is stuck inactive for 99962.462932, current state peering,
>>> > last
>>> > acting [37,16,34]
>>> > pg 1.141 is stuck inactive for 138257.572476, current state
>>> > remapped+peering, last acting [5,17,49]
>>> > pg 2.141 is stuck inactive for 102567.745720, current state peering,
>>> > last
>>> > acting [36,7,15]
>>> > pg 3.144 is stuck inactive for 138218.289585, current state
>>> > remapped+peering, last acting [18,28,16]
>>> > pg 1.14d is stuck inactive for 138260.030530, current state peering,
>>> > last
>>> > acting [46,43,17]
>>> > pg 3.155 is stuck inactive for 138227.368541, current state
>>> > remapped+peering, last acting [33,20,52]
>>> > pg 2.8d is stuck inactive for 100251.802576, current state peering,
>>> > last
>>> > acting [6,39,27]
>>> > pg 2.15c is stuck inactive for 102567.512279, current state
>>> > remapped+peering, last acting [7,35,49]
>>> > pg 2.167 is stuck inactive for 138260.093367, current state peering,
>>> > last
>>> > acting [35,23,17]
>>> > pg 3.9d is stuck inactive for 117050.294600, current state peering,
>>> > last
>>> > acting [12,51,23]
>>> > pg 2.16e is stuck inactive for 99846.214239, current state peering,
>>> > last
>>> > acting [25,5,8]
>>> > pg 2.17b is stuck inactive for 99733.504794, current state peering,
>>> > last
>>> > acting [49,27,14]
>>> > pg 3.178 is stuck inactive for 99973.600671, current state peering,
>>> > last
>>> > acting [29,16,40]
>>> > pg 3.240 is stuck inactive for 28768.488851, current state
>>> > remapped+peering,
>>> > last acting [33,8,32]
>>> > pg 3.b6 is stuck inactive for 138222.461160, current state peering,
>>> > last
>>> > acting [26,29,34]
>>> > pg 2.17e is stuck inactive for 159229.154401, current state peering,
>>> > last
>>> > acting [13,42,48]
>>> > pg 2.17c is stuck inactive for 104921.767401, current state
>>> > remapped+peering, last acting [23,12,24]
>>> > pg 3.17d is stuck inactive for 137563.979966, current state
>>> > remapped+peering, last acting [43,24,29]
>>> > pg 1.24b is stuck inactive for 93144.933177, current state peering,
>>> > last
>>> > acting [43,20,37]
>>> > pg 1.bd is stuck inactive for 102616.793475, current state peering,
>>> > last
>>> > acting [16,30,35]
>>> > pg 3.1d6 is stuck inactive for 99974.485247, current state peering,
>>> > last
>>> > acting [16,38,29]
>>> > pg 2.172 is stuck inactive for 193919.627310, current state inactive,
>>> > last
>>> > acting [39,21,10]
>>> > pg 1.171 is stuck inactive for 104947.558748, current state peering,
>>> > last
>>> > acting [49,9,25]
>>> > pg 1.243 is stuck inactive for 208452.393430, current state peering,
>>> > last
>>> > acting [45,32,24]
>>> > pg 3.aa is stuck inactive for 104958.230601, current state
>>> > remapped+peering,
>>> > last acting [51,12,13]
>>> >
>>> > 41 osds have slow requests
>>> > recovery 2/11091408 objects degraded (0.000%)
>>> > recovery 1778127/11091408 objects misplaced (16.032%)
>>> > nodown,noout,noscrub,nodeep-scrub flag(s) set
>>> >
>>> > That is what we seem to be getting  a lot of. It appears the PG's are
>>> > just
>>> > stuck as inactive. I am not sure how to get around that.
>>> >
>>> > Thanks,
>>> >
>>> > On Fri, Jun 2, 2017 at 12:55 PM, Saverio Proto <[email protected]>
>>> > wrote:
>>> >>
>>> >> Usually 'ceph health detail' gives better info on what is making
>>> >> everything stuck.
>>> >>
>>> >> Saverio
>>> >>
>>> >> 2017-06-02 13:51 GMT+02:00 Grant Morley <[email protected]>:
>>> >> > Hi All,
>>> >> >
>>> >> > I wonder if anyone could help at all.
>>> >> >
>>> >> > We were doing some routine maintenance on our ceph cluster and after
>>> >> > running
>>> >> > a "service ceph-all restart" on one of our nodes we noticed that
>>> >> > something
>>> >> > wasn't quite right. The cluster has gone into an error mode and we
>>> >> > have
>>> >> > multiple stuck PGs and the object replacement recovery is taking a
>>> >> > strangely
>>> >> > long time. At first there was about 46% objects misplaced and we now
>>> >> > have
>>> >> > roughly 16%.
>>> >> >
>>> >> > However it has taken about 36 hours to do the recovery so far and
>>> >> > with a
>>> >> > possible 16 to go we are looking at a fairly major issue. As a lot
>>> >> > of
>>> >> > the
>>> >> > system is now blocked for read / writes, customers cannot access
>>> >> > their
>>> >> > VMs.
>>> >> >
>>> >> > I think the main issue at the moment is that we have 210pgs stuck
>>> >> > inactive
>>> >> > and nothing we seem to do can get them to peer.
>>> >> >
>>> >> > Below is an ouptut of the ceph status. Can anyone help or have any
>>> >> > ideas
>>> >> > on
>>> >> > how to speed up the recover process? We have tried turning down
>>> >> > logging
>>> >> > on
>>> >> > the OSD's but some are going so slow they wont allow us to
>>> >> > injectargs
>>> >> > into
>>> >> > them.
>>> >> >
>>> >> > health HEALTH_ERR
>>> >> >             210 pgs are stuck inactive for more than 300 seconds
>>> >> >             298 pgs backfill_wait
>>> >> >             3 pgs backfilling
>>> >> >             1 pgs degraded
>>> >> >             200 pgs peering
>>> >> >             1 pgs recovery_wait
>>> >> >             1 pgs stuck degraded
>>> >> >             210 pgs stuck inactive
>>> >> >             512 pgs stuck unclean
>>> >> >             3310 requests are blocked > 32 sec
>>> >> >             recovery 2/11094405 objects degraded (0.000%)
>>> >> >             recovery 1785063/11094405 objects misplaced (16.090%)
>>> >> >             nodown,noout,noscrub,nodeep-scrub flag(s) set
>>> >> >
>>> >> >             election epoch 16314, quorum 0,1,2,3,4,5,6,7,8
>>> >> >
>>> >> >
>>> >> > storage-1,storage-2,storage-3,storage-4,storage-5,storage-6,storage-7,storage-8,storage-9
>>> >> >      osdmap e213164: 54 osds: 54 up, 54 in; 329 remapped pgs
>>> >> >             flags nodown,noout,noscrub,nodeep-scrub
>>> >> >       pgmap v41030942: 2036 pgs, 14 pools, 14183 GB data, 3309
>>> >> > kobjects
>>> >> >             43356 GB used, 47141 GB / 90498 GB avail
>>> >> >             2/11094405 objects degraded (0.000%)
>>> >> >             1785063/11094405 objects misplaced (16.090%)
>>> >> >                 1524 active+clean
>>> >> >                  298 active+remapped+wait_backfill
>>> >> >                  153 peering
>>> >> >                   47 remapped+peering
>>> >> >                   10 inactive
>>> >> >                    3 active+remapped+backfilling
>>> >> >                    1 active+recovery_wait+degraded+remapped
>>> >> >
>>> >> > Many thanks,
>>> >> >
>>> >> > Grant
>>> >> >
>>> >> >
>>> >> >
>>> >> > _______________________________________________
>>> >> > OpenStack-operators mailing list
>>> >> > [email protected]
>>> >> >
>>> >> > http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators
>>> >> >
>>> >
>>> >
>>>
>>> _______________________________________________
>>> OpenStack-operators mailing list
>>> [email protected]
>>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators
>>
>>
>>
>> _______________________________________________
>> OpenStack-operators mailing list
>> [email protected]
>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators
>>
>
>
> _______________________________________________
> OpenStack-operators mailing list
> [email protected]
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators
>

_______________________________________________
OpenStack-operators mailing list
[email protected]
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators

Re: [Openstack-operators] Ceph recovery going unusually slow

Reply via email to