Usually 'ceph health detail' gives better info on what is making everything stuck.
Saverio 2017-06-02 13:51 GMT+02:00 Grant Morley <grantmorley1...@gmail.com>: > Hi All, > > I wonder if anyone could help at all. > > We were doing some routine maintenance on our ceph cluster and after running > a "service ceph-all restart" on one of our nodes we noticed that something > wasn't quite right. The cluster has gone into an error mode and we have > multiple stuck PGs and the object replacement recovery is taking a strangely > long time. At first there was about 46% objects misplaced and we now have > roughly 16%. > > However it has taken about 36 hours to do the recovery so far and with a > possible 16 to go we are looking at a fairly major issue. As a lot of the > system is now blocked for read / writes, customers cannot access their VMs. > > I think the main issue at the moment is that we have 210pgs stuck inactive > and nothing we seem to do can get them to peer. > > Below is an ouptut of the ceph status. Can anyone help or have any ideas on > how to speed up the recover process? We have tried turning down logging on > the OSD's but some are going so slow they wont allow us to injectargs into > them. > > health HEALTH_ERR > 210 pgs are stuck inactive for more than 300 seconds > 298 pgs backfill_wait > 3 pgs backfilling > 1 pgs degraded > 200 pgs peering > 1 pgs recovery_wait > 1 pgs stuck degraded > 210 pgs stuck inactive > 512 pgs stuck unclean > 3310 requests are blocked > 32 sec > recovery 2/11094405 objects degraded (0.000%) > recovery 1785063/11094405 objects misplaced (16.090%) > nodown,noout,noscrub,nodeep-scrub flag(s) set > > election epoch 16314, quorum 0,1,2,3,4,5,6,7,8 > storage-1,storage-2,storage-3,storage-4,storage-5,storage-6,storage-7,storage-8,storage-9 > osdmap e213164: 54 osds: 54 up, 54 in; 329 remapped pgs > flags nodown,noout,noscrub,nodeep-scrub > pgmap v41030942: 2036 pgs, 14 pools, 14183 GB data, 3309 kobjects > 43356 GB used, 47141 GB / 90498 GB avail > 2/11094405 objects degraded (0.000%) > 1785063/11094405 objects misplaced (16.090%) > 1524 active+clean > 298 active+remapped+wait_backfill > 153 peering > 47 remapped+peering > 10 inactive > 3 active+remapped+backfilling > 1 active+recovery_wait+degraded+remapped > > Many thanks, > > Grant > > > > _______________________________________________ > OpenStack-operators mailing list > OpenStack-operators@lists.openstack.org > http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators > _______________________________________________ OpenStack-operators mailing list OpenStack-operators@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators