I'd suggest creating a tracker and uploading a full debug log from the primary so we can look at this in more detail.
On Mon, Feb 13, 2017 at 9:11 PM, <george.vasilaka...@stfc.ac.uk> wrote: > Hi Brad, > > I could not tell you that as `ceph pg 1.323 query` never completes, it just > hangs there. > > On 11/02/2017, 00:40, "Brad Hubbard" <bhubb...@redhat.com> wrote: > > On Thu, Feb 9, 2017 at 3:36 AM, <george.vasilaka...@stfc.ac.uk> wrote: > > Hi Corentin, > > > > I've tried that, the primary hangs when trying to injectargs so I set > the option in the config file and restarted all OSDs in the PG, it came up > with: > > > > pg 1.323 is remapped+peering, acting > [595,1391,2147483647,127,937,362,267,320,7,634,716] > > > > Still can't query the PG, no error messages in the logs of osd.240. > > The logs on osd.595 and osd.7 still fill up with the same messages. > > So what does "peering_blocked_by_detail" show in that case since it > can no longer show "peering_blocked_by_history_les_bound"? > > > > > Regards, > > > > George > > ________________________________ > > From: Corentin Bonneton [l...@titin.fr] > > Sent: 08 February 2017 16:31 > > To: Vasilakakos, George (STFC,RAL,SC) > > Cc: ceph-users@lists.ceph.com > > Subject: Re: [ceph-users] PG stuck peering after host reboot > > > > Hello, > > > > I already had the case, I applied the parameter > (osd_find_best_info_ignore_history_les) to all the osd that have reported the > queries blocked. > > > > -- > > Cordialement, > > CEO FEELB | Corentin BONNETON > > cont...@feelb.io<mailto:cont...@feelb.io> > > > > Le 8 févr. 2017 à 17:17, > george.vasilaka...@stfc.ac.uk<mailto:george.vasilaka...@stfc.ac.uk> a écrit : > > > > Hi Ceph folks, > > > > I have a cluster running Jewel 10.2.5 using a mix EC and replicated > pools. > > > > After rebooting a host last night, one PG refuses to complete peering > > > > pg 1.323 is stuck inactive for 73352.498493, current state peering, > last acting [595,1391,240,127,937,362,267,320,7,634,716] > > > > Restarting OSDs or hosts does nothing to help, or sometimes results in > things like this: > > > > pg 1.323 is remapped+peering, acting > [2147483647,1391,240,127,937,362,267,320,7,634,716] > > > > > > The host that was rebooted is home to osd.7 (8). If I go onto it to > look at the logs for osd.7 this is what I see: > > > > $ tail -f /var/log/ceph/ceph-osd.7.log > > 2017-02-08 15:41:00.445247 7f5fcc2bd700 0 -- > XXX.XXX.XXX.172:6905/20510 >> XXX.XXX.XXX.192:6921/55371 pipe(0x7f6074a0b400 > sd=34 :42828 s=2 pgs=319 cs=471 l=0 c=0x7f6070086700).fault, initiating > reconnect > > > > I'm assuming that in IP1:port1/PID1 >> IP2:port2/PID2 the >> indicates > the direction of communication. I've traced these to osd.7 (rank 8 in the > stuck PG) reaching out to osd.595 (the primary in the stuck PG). > > > > Meanwhile, looking at the logs of osd.595 I see this: > > > > $ tail -f /var/log/ceph/ceph-osd.595.log > > 2017-02-08 15:41:15.760708 7f1765673700 0 -- > XXX.XXX.XXX.192:6921/55371 >> XXX.XXX.XXX.172:6905/20510 pipe(0x7f17b2911400 > sd=101 :6921 s=0 pgs=0 cs=0 l=0 c=0x7f17b7beaf00).accept connect_seq 478 vs > existing 477 state standby > > 2017-02-08 15:41:20.768844 7f1765673700 0 bad crc in front 1941070384 > != exp 3786596716 > > > > which again shows osd.595 reaching out to osd.7 and from what I could > gather the CRC problem is about messaging. > > > > Google searching has yielded nothing particularly useful on how to get > this unstuck. > > > > ceph pg 1.323 query seems to hang forever but it completed once last > night and I noticed this: > > > > "peering_blocked_by_detail": [ > > { > > "detail": "peering_blocked_by_history_les_bound" > > } > > > > We have seen this before and it was cleared by setting > osd_find_best_info_ignore_history_les to true for the first two OSDs on the > stuck PGs (this was on a 3 replica pool). This hasn't worked in this case and > I suspect the option needs to be set on either a majority of OSDs or enough k > number of OSDs to be able to use their data and ignore history. > > > > We would really appreciate any guidance and/or help the community can > offer! > > > > _______________________________________________ > > ceph-users mailing list > > ceph-users@lists.ceph.com > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > > > -- > Cheers, > Brad > > -- Cheers, Brad _______________________________________________ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com