Hi Kevin, Unfortunately restarting OSD don't appear to help, instead it seems to make it worse with PGs getting stuck degraded.
Best regards /Magnus 2018-07-11 20:46 GMT+02:00 Kevin Olbrich <k...@sv01.de>: > Sounds a little bit like the problem I had on OSDs: > > [ceph-users] Blocked requests activating+remapped after extending pg(p)_num > > <http://lists.ceph.com/pipermail/ceph-users-ceph.com/2018-May/026680.html> > *Kevin Olbrich* > > - [ceph-users] Blocked requests activating+remapped > afterextendingpg(p)_num > <http://lists.ceph.com/pipermail/ceph-users-ceph.com/2018-May/026681.html> > *Burkhard Linke* > - [ceph-users] Blocked requests activating+remapped > afterextendingpg(p)_num > > <http://lists.ceph.com/pipermail/ceph-users-ceph.com/2018-May/026682.html> > *Kevin Olbrich* > - [ceph-users] Blocked requests activating+remapped > afterextendingpg(p)_num > > <http://lists.ceph.com/pipermail/ceph-users-ceph.com/2018-May/026683.html> > *Kevin Olbrich* > - [ceph-users] Blocked requests activating+remapped > afterextendingpg(p)_num > > <http://lists.ceph.com/pipermail/ceph-users-ceph.com/2018-May/026685.html> > *Kevin Olbrich* > - [ceph-users] Blocked requests activating+remapped > afterextendingpg(p)_num > > <http://lists.ceph.com/pipermail/ceph-users-ceph.com/2018-May/026689.html> > *Kevin Olbrich* > - [ceph-users] Blocked requests activating+remapped > afterextendingpg(p)_num > > <http://lists.ceph.com/pipermail/ceph-users-ceph.com/2018-May/026692.html> > *Paul Emmerich* > - [ceph-users] Blocked requests activating+remapped > afterextendingpg(p)_num > > <http://lists.ceph.com/pipermail/ceph-users-ceph.com/2018-May/026695.html> > *Kevin Olbrich* > > I ended up restarting the OSDs which were stuck in that state and they > immediately fixed themselfs. > It should also work to just "out" the problem-OSDs and immeditly up them > again to fix it. > > - Kevin > > 2018-07-11 20:30 GMT+02:00 Magnus Grönlund <mag...@gronlund.se>: > >> Hi, >> >> Started to upgrade a ceph-cluster from Jewel (10.2.10) to Luminous >> (12.2.6) >> >> After upgrading and restarting the mons everything looked OK, the mons >> had quorum, all OSDs where up and in and all the PGs where active+clean. >> But before I had time to start upgrading the OSDs it became obvious that >> something had gone terribly wrong. >> All of a sudden 1600 out of 4100 PGs where inactive and 40% of the data >> was misplaced! >> >> The mons appears OK and all OSDs are still up and in, but a few hours >> later there was still 1483 pgs stuck inactive, essentially all of them in >> peering! >> Investigating one of the stuck PGs it appears to be looping between >> “inactive”, “remapped+peering” and “peering” and the epoch number is rising >> fast, see the attached pg query outputs. >> >> We really can’t afford to loose the cluster or the data so any help or >> suggestions on how to debug or fix this issue would be very, very >> appreciated! >> >> >> health: HEALTH_ERR >> 1483 pgs are stuck inactive for more than 60 seconds >> 542 pgs backfill_wait >> 14 pgs backfilling >> 11 pgs degraded >> 1402 pgs peering >> 3 pgs recovery_wait >> 11 pgs stuck degraded >> 1483 pgs stuck inactive >> 2042 pgs stuck unclean >> 7 pgs stuck undersized >> 7 pgs undersized >> 111 requests are blocked > 32 sec >> 10586 requests are blocked > 4096 sec >> recovery 9472/11120724 objects degraded (0.085%) >> recovery 1181567/11120724 objects misplaced (10.625%) >> noout flag(s) set >> mon.eselde02u32 low disk space >> >> services: >> mon: 3 daemons, quorum eselde02u32,eselde02u33,eselde02u34 >> mgr: eselde02u32(active), standbys: eselde02u33, eselde02u34 >> osd: 111 osds: 111 up, 111 in; 800 remapped pgs >> flags noout >> >> data: >> pools: 18 pools, 4104 pgs >> objects: 3620k objects, 13875 GB >> usage: 42254 GB used, 160 TB / 201 TB avail >> pgs: 1.876% pgs unknown >> 34.259% pgs not active >> 9472/11120724 objects degraded (0.085%) >> 1181567/11120724 objects misplaced (10.625%) >> 2062 active+clean >> 1221 peering >> 535 active+remapped+backfill_wait >> 181 remapped+peering >> 77 unknown >> 13 active+remapped+backfilling >> 7 active+undersized+degraded+remapped+backfill_wait >> 4 remapped >> 3 active+recovery_wait+degraded+remapped >> 1 active+degraded+remapped+backfilling >> >> io: >> recovery: 298 MB/s, 77 objects/s >> >> _______________________________________________ >> ceph-users mailing list >> ceph-users@lists.ceph.com >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >> >> >
_______________________________________________ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com