I would start troubleshooting restarting those OSDs that dont respond to ceph pg XXXX query.
Check the log files of those OSDs when you restart. Saverio 2017-06-02 14:42 GMT+02:00 Grant Morley <grantmorley1...@gmail.com>: > We was just putting a security patch on that got released. > > After that on one of the nodes we ran a service ceph-all restart and the > cluster went into error status. > > What we have been doing to query the pgs so far is: > > We get a list of the PGs from the output of ceph health detail > > or we have used: ceph pg dump_stuck inactive > > we then have a list of about 200 PGs and they are either peering or > remapped+peering or inactive. > > When we query those we use: ceph pg PGNUM query > > Sometimes this replies with something like: > > > https://pastebin.com/m3bH34RB > > > Or we get a huge output that seems to be a sequence of peering and then > remapping , then end of which looks like this: > > https://pastebin.com/EwW5rftq > > Or we get no reply at all and it just hangs forever. > > So, we have OSDs that hang forever when contacted directly using: ceph pg > XXXX query. If we look at the OSDs then we also have OSDs that are not > responding to queries such as: ceph tell osd.24 version - which also hangs > forever. If we restart the OSD service it can reply and then hang again > forever. > > We may have more than 1 problem. 1 OSDs hanging on queries such as a > simple: ceph tell osd.XX version > > What causes that? > > The other is the PGs that are not peering correctly, the NICs are all > correct, we tested the network connection and it is working and the ports > are open, but the peering process is not working between the OSDs for some > PGs and we have been unable to unstick it. > > Thanks, > > > On Fri, Jun 2, 2017 at 1:18 PM, Tyanko Aleksiev <tyanko.alex...@gmail.com> > wrote: >> >> Additionally, it could be useful to know what did you do during the >> maintenance. >> >> Cheers, >> Tyanko >> >> On 2 June 2017 at 14:08, Saverio Proto <ziopr...@gmail.com> wrote: >>> >>> To give you some help you need to tell us the ceph version you are >>> using and from ceph.conf in the section [osd] what values you have for >>> the following ? >>> >>> [osd] >>> osd max backfills >>> osd recovery max active >>> osd recovery op priority >>> >>> these three settings can influence the recovery speed. >>> >>> Also, do you have big enough limits ? >>> >>> Check on any host the content of: /proc/`pid_of_the_osd`/limits >>> >>> >>> Saverio >>> >>> 2017-06-02 14:00 GMT+02:00 Grant Morley <grantmorley1...@gmail.com>: >>> > HEALTH_ERR 210 pgs are stuck inactive for more than 300 seconds; 296 >>> > pgs >>> > backfill_wait; 3 pgs backfilling; 1 pgs degraded; 202 pgs peering; 1 >>> > pgs >>> > recovery_wait; 1 pgs stuck degraded; 210 pgs stuck inactive; 510 pgs >>> > stuck >>> > unclean; 3308 requests are blocked > 32 sec; 41 osds have slow >>> > requests; >>> > recovery 2/11091408 objects degraded (0.000%); recovery >>> > 1778127/11091408 >>> > objects misplaced (16.032%); nodown,noout,noscrub,nodeep-scrub flag(s) >>> > set >>> > >>> > pg 3.235 is stuck inactive for 138232.508429, current state peering, >>> > last >>> > acting [11,26,1] >>> > pg 1.237 is stuck inactive for 138260.482588, current state peering, >>> > last >>> > acting [8,41,34] >>> > pg 2.231 is stuck inactive for 138258.316031, current state peering, >>> > last >>> > acting [24,53,8] >>> > pg 2.22e is stuck inactive for 194033.321591, current state >>> > remapped+peering, last acting [0,29,1] >>> > pg 1.22c is stuck inactive for 102514.200154, current state peering, >>> > last >>> > acting [51,7,20] >>> > pg 2.228 is stuck inactive for 138258.317797, current state peering, >>> > last >>> > acting [53,4,34] >>> > pg 1.227 is stuck inactive for 138258.244681, current state >>> > remapped+peering, last acting [48,35,11] >>> > pg 2.220 is stuck inactive for 193940.066322, current state >>> > remapped+peering, last acting [9,39,8] >>> > pg 1.222 is stuck inactive for 101474.087688, current state peering, >>> > last >>> > acting [23,11,35] >>> > pg 3.130 is stuck inactive for 99735.451290, current state peering, >>> > last >>> > acting [27,37,17] >>> > pg 3.136 is stuck inactive for 138221.552865, current state peering, >>> > last >>> > acting [26,49,10] >>> > pg 3.13c is stuck inactive for 137563.906503, current state peering, >>> > last >>> > acting [51,53,7] >>> > pg 2.142 is stuck inactive for 99962.462932, current state peering, >>> > last >>> > acting [37,16,34] >>> > pg 1.141 is stuck inactive for 138257.572476, current state >>> > remapped+peering, last acting [5,17,49] >>> > pg 2.141 is stuck inactive for 102567.745720, current state peering, >>> > last >>> > acting [36,7,15] >>> > pg 3.144 is stuck inactive for 138218.289585, current state >>> > remapped+peering, last acting [18,28,16] >>> > pg 1.14d is stuck inactive for 138260.030530, current state peering, >>> > last >>> > acting [46,43,17] >>> > pg 3.155 is stuck inactive for 138227.368541, current state >>> > remapped+peering, last acting [33,20,52] >>> > pg 2.8d is stuck inactive for 100251.802576, current state peering, >>> > last >>> > acting [6,39,27] >>> > pg 2.15c is stuck inactive for 102567.512279, current state >>> > remapped+peering, last acting [7,35,49] >>> > pg 2.167 is stuck inactive for 138260.093367, current state peering, >>> > last >>> > acting [35,23,17] >>> > pg 3.9d is stuck inactive for 117050.294600, current state peering, >>> > last >>> > acting [12,51,23] >>> > pg 2.16e is stuck inactive for 99846.214239, current state peering, >>> > last >>> > acting [25,5,8] >>> > pg 2.17b is stuck inactive for 99733.504794, current state peering, >>> > last >>> > acting [49,27,14] >>> > pg 3.178 is stuck inactive for 99973.600671, current state peering, >>> > last >>> > acting [29,16,40] >>> > pg 3.240 is stuck inactive for 28768.488851, current state >>> > remapped+peering, >>> > last acting [33,8,32] >>> > pg 3.b6 is stuck inactive for 138222.461160, current state peering, >>> > last >>> > acting [26,29,34] >>> > pg 2.17e is stuck inactive for 159229.154401, current state peering, >>> > last >>> > acting [13,42,48] >>> > pg 2.17c is stuck inactive for 104921.767401, current state >>> > remapped+peering, last acting [23,12,24] >>> > pg 3.17d is stuck inactive for 137563.979966, current state >>> > remapped+peering, last acting [43,24,29] >>> > pg 1.24b is stuck inactive for 93144.933177, current state peering, >>> > last >>> > acting [43,20,37] >>> > pg 1.bd is stuck inactive for 102616.793475, current state peering, >>> > last >>> > acting [16,30,35] >>> > pg 3.1d6 is stuck inactive for 99974.485247, current state peering, >>> > last >>> > acting [16,38,29] >>> > pg 2.172 is stuck inactive for 193919.627310, current state inactive, >>> > last >>> > acting [39,21,10] >>> > pg 1.171 is stuck inactive for 104947.558748, current state peering, >>> > last >>> > acting [49,9,25] >>> > pg 1.243 is stuck inactive for 208452.393430, current state peering, >>> > last >>> > acting [45,32,24] >>> > pg 3.aa is stuck inactive for 104958.230601, current state >>> > remapped+peering, >>> > last acting [51,12,13] >>> > >>> > 41 osds have slow requests >>> > recovery 2/11091408 objects degraded (0.000%) >>> > recovery 1778127/11091408 objects misplaced (16.032%) >>> > nodown,noout,noscrub,nodeep-scrub flag(s) set >>> > >>> > That is what we seem to be getting a lot of. It appears the PG's are >>> > just >>> > stuck as inactive. I am not sure how to get around that. >>> > >>> > Thanks, >>> > >>> > On Fri, Jun 2, 2017 at 12:55 PM, Saverio Proto <ziopr...@gmail.com> >>> > wrote: >>> >> >>> >> Usually 'ceph health detail' gives better info on what is making >>> >> everything stuck. >>> >> >>> >> Saverio >>> >> >>> >> 2017-06-02 13:51 GMT+02:00 Grant Morley <grantmorley1...@gmail.com>: >>> >> > Hi All, >>> >> > >>> >> > I wonder if anyone could help at all. >>> >> > >>> >> > We were doing some routine maintenance on our ceph cluster and after >>> >> > running >>> >> > a "service ceph-all restart" on one of our nodes we noticed that >>> >> > something >>> >> > wasn't quite right. The cluster has gone into an error mode and we >>> >> > have >>> >> > multiple stuck PGs and the object replacement recovery is taking a >>> >> > strangely >>> >> > long time. At first there was about 46% objects misplaced and we now >>> >> > have >>> >> > roughly 16%. >>> >> > >>> >> > However it has taken about 36 hours to do the recovery so far and >>> >> > with a >>> >> > possible 16 to go we are looking at a fairly major issue. As a lot >>> >> > of >>> >> > the >>> >> > system is now blocked for read / writes, customers cannot access >>> >> > their >>> >> > VMs. >>> >> > >>> >> > I think the main issue at the moment is that we have 210pgs stuck >>> >> > inactive >>> >> > and nothing we seem to do can get them to peer. >>> >> > >>> >> > Below is an ouptut of the ceph status. Can anyone help or have any >>> >> > ideas >>> >> > on >>> >> > how to speed up the recover process? We have tried turning down >>> >> > logging >>> >> > on >>> >> > the OSD's but some are going so slow they wont allow us to >>> >> > injectargs >>> >> > into >>> >> > them. >>> >> > >>> >> > health HEALTH_ERR >>> >> > 210 pgs are stuck inactive for more than 300 seconds >>> >> > 298 pgs backfill_wait >>> >> > 3 pgs backfilling >>> >> > 1 pgs degraded >>> >> > 200 pgs peering >>> >> > 1 pgs recovery_wait >>> >> > 1 pgs stuck degraded >>> >> > 210 pgs stuck inactive >>> >> > 512 pgs stuck unclean >>> >> > 3310 requests are blocked > 32 sec >>> >> > recovery 2/11094405 objects degraded (0.000%) >>> >> > recovery 1785063/11094405 objects misplaced (16.090%) >>> >> > nodown,noout,noscrub,nodeep-scrub flag(s) set >>> >> > >>> >> > election epoch 16314, quorum 0,1,2,3,4,5,6,7,8 >>> >> > >>> >> > >>> >> > storage-1,storage-2,storage-3,storage-4,storage-5,storage-6,storage-7,storage-8,storage-9 >>> >> > osdmap e213164: 54 osds: 54 up, 54 in; 329 remapped pgs >>> >> > flags nodown,noout,noscrub,nodeep-scrub >>> >> > pgmap v41030942: 2036 pgs, 14 pools, 14183 GB data, 3309 >>> >> > kobjects >>> >> > 43356 GB used, 47141 GB / 90498 GB avail >>> >> > 2/11094405 objects degraded (0.000%) >>> >> > 1785063/11094405 objects misplaced (16.090%) >>> >> > 1524 active+clean >>> >> > 298 active+remapped+wait_backfill >>> >> > 153 peering >>> >> > 47 remapped+peering >>> >> > 10 inactive >>> >> > 3 active+remapped+backfilling >>> >> > 1 active+recovery_wait+degraded+remapped >>> >> > >>> >> > Many thanks, >>> >> > >>> >> > Grant >>> >> > >>> >> > >>> >> > >>> >> > _______________________________________________ >>> >> > OpenStack-operators mailing list >>> >> > OpenStack-operators@lists.openstack.org >>> >> > >>> >> > http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators >>> >> > >>> > >>> > >>> >>> _______________________________________________ >>> OpenStack-operators mailing list >>> OpenStack-operators@lists.openstack.org >>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators >> >> >> >> _______________________________________________ >> OpenStack-operators mailing list >> OpenStack-operators@lists.openstack.org >> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators >> > > > _______________________________________________ > OpenStack-operators mailing list > OpenStack-operators@lists.openstack.org > http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators > _______________________________________________ OpenStack-operators mailing list OpenStack-operators@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators