Re: [ceph-users] PGs stuck peering (looping?) after upgrade to Luminous.

Magnus Grönlund Wed, 11 Jul 2018 14:08:33 -0700

Hi Kevin,

Unfortunately restarting OSD don't appear to help, instead it seems to make
it worse with PGs getting stuck degraded.


Best regards
/Magnus

2018-07-11 20:46 GMT+02:00 Kevin Olbrich <k...@sv01.de>:

> Sounds a little bit like the problem I had on OSDs:
>
> [ceph-users] Blocked requests activating+remapped after extending pg(p)_num
>
> <http://lists.ceph.com/pipermail/ceph-users-ceph.com/2018-May/026680.html>
>   *Kevin Olbrich*
>
>    - [ceph-users] Blocked requests activating+remapped
>    afterextendingpg(p)_num
>    <http://lists.ceph.com/pipermail/ceph-users-ceph.com/2018-May/026681.html>
>      *Burkhard Linke*
>       - [ceph-users] Blocked requests activating+remapped
>       afterextendingpg(p)_num
>       
> <http://lists.ceph.com/pipermail/ceph-users-ceph.com/2018-May/026682.html>
>         *Kevin Olbrich*
>          - [ceph-users] Blocked requests activating+remapped
>          afterextendingpg(p)_num
>          
> <http://lists.ceph.com/pipermail/ceph-users-ceph.com/2018-May/026683.html>
>            *Kevin Olbrich*
>          - [ceph-users] Blocked requests activating+remapped
>          afterextendingpg(p)_num
>          
> <http://lists.ceph.com/pipermail/ceph-users-ceph.com/2018-May/026685.html>
>            *Kevin Olbrich*
>          - [ceph-users] Blocked requests activating+remapped
>          afterextendingpg(p)_num
>          
> <http://lists.ceph.com/pipermail/ceph-users-ceph.com/2018-May/026689.html>
>            *Kevin Olbrich*
>          - [ceph-users] Blocked requests activating+remapped
>          afterextendingpg(p)_num
>          
> <http://lists.ceph.com/pipermail/ceph-users-ceph.com/2018-May/026692.html>
>            *Paul Emmerich*
>          - [ceph-users] Blocked requests activating+remapped
>          afterextendingpg(p)_num
>          
> <http://lists.ceph.com/pipermail/ceph-users-ceph.com/2018-May/026695.html>
>            *Kevin Olbrich*
>
> I ended up restarting the OSDs which were stuck in that state and they
> immediately fixed themselfs.
> It should also work to just "out" the problem-OSDs and immeditly up them
> again to fix it.
>
> - Kevin
>
> 2018-07-11 20:30 GMT+02:00 Magnus Grönlund <mag...@gronlund.se>:
>
>> Hi,
>>
>> Started to upgrade a ceph-cluster from Jewel (10.2.10) to Luminous
>> (12.2.6)
>>
>> After upgrading and restarting the mons everything looked OK, the mons
>> had quorum, all OSDs where up and in and all the PGs where active+clean.
>> But before I had time to start upgrading the OSDs it became obvious that
>> something had gone terribly wrong.
>> All of a sudden 1600 out of 4100 PGs where inactive and 40% of the data
>> was misplaced!
>>
>> The mons appears OK and all OSDs are still up and in, but a few hours
>> later there was still 1483 pgs stuck inactive, essentially all of them in
>> peering!
>> Investigating one of the stuck PGs it appears to be looping between
>> “inactive”, “remapped+peering” and “peering” and the epoch number is rising
>> fast, see the attached pg query outputs.
>>
>> We really can’t afford to loose the cluster or the data so any help or
>> suggestions on how to debug or fix this issue would be very, very
>> appreciated!
>>
>>
>>     health: HEALTH_ERR
>>             1483 pgs are stuck inactive for more than 60 seconds
>>             542 pgs backfill_wait
>>             14 pgs backfilling
>>             11 pgs degraded
>>             1402 pgs peering
>>             3 pgs recovery_wait
>>             11 pgs stuck degraded
>>             1483 pgs stuck inactive
>>             2042 pgs stuck unclean
>>             7 pgs stuck undersized
>>             7 pgs undersized
>>             111 requests are blocked > 32 sec
>>             10586 requests are blocked > 4096 sec
>>             recovery 9472/11120724 objects degraded (0.085%)
>>             recovery 1181567/11120724 objects misplaced (10.625%)
>>             noout flag(s) set
>>             mon.eselde02u32 low disk space
>>
>>   services:
>>     mon: 3 daemons, quorum eselde02u32,eselde02u33,eselde02u34
>>     mgr: eselde02u32(active), standbys: eselde02u33, eselde02u34
>>     osd: 111 osds: 111 up, 111 in; 800 remapped pgs
>>          flags noout
>>
>>   data:
>>     pools:   18 pools, 4104 pgs
>>     objects: 3620k objects, 13875 GB
>>     usage:   42254 GB used, 160 TB / 201 TB avail
>>     pgs:     1.876% pgs unknown
>>              34.259% pgs not active
>>              9472/11120724 objects degraded (0.085%)
>>              1181567/11120724 objects misplaced (10.625%)
>>              2062 active+clean
>>             1221 peering
>>              535  active+remapped+backfill_wait
>>              181  remapped+peering
>>              77   unknown
>>              13   active+remapped+backfilling
>>              7    active+undersized+degraded+remapped+backfill_wait
>>              4    remapped
>>              3    active+recovery_wait+degraded+remapped
>>              1    active+degraded+remapped+backfilling
>>
>>   io:
>>     recovery: 298 MB/s, 77 objects/s
>>
>> _______________________________________________
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>>
>

_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] PGs stuck peering (looping?) after upgrade to Luminous.

Reply via email to