Hi folks, I have just made a tracker for this issue: http://tracker.ceph.com/issues/18960 I used ceph-post-file to upload some logs from the primary OSD for the troubled PG.
Any help would be appreciated. If we can't get it to peer, we'd like to at least get it unstuck, even if it means data loss. What's the proper way to go about doing that? Best regards, George ________________________________________ From: ceph-users [[email protected]] on behalf of [email protected] [[email protected]] Sent: 14 February 2017 10:27 To: [email protected]; [email protected] Subject: Re: [ceph-users] PG stuck peering after host reboot Hi Brad, I'll be doing so later in the day. Thanks, George ________________________________________ From: Brad Hubbard [[email protected]] Sent: 13 February 2017 22:03 To: Vasilakakos, George (STFC,RAL,SC); Ceph Users Subject: Re: [ceph-users] PG stuck peering after host reboot I'd suggest creating a tracker and uploading a full debug log from the primary so we can look at this in more detail. On Mon, Feb 13, 2017 at 9:11 PM, <[email protected]> wrote: > Hi Brad, > > I could not tell you that as `ceph pg 1.323 query` never completes, it just > hangs there. > > On 11/02/2017, 00:40, "Brad Hubbard" <[email protected]> wrote: > > On Thu, Feb 9, 2017 at 3:36 AM, <[email protected]> wrote: > > Hi Corentin, > > > > I've tried that, the primary hangs when trying to injectargs so I set > the option in the config file and restarted all OSDs in the PG, it came up > with: > > > > pg 1.323 is remapped+peering, acting > [595,1391,2147483647,127,937,362,267,320,7,634,716] > > > > Still can't query the PG, no error messages in the logs of osd.240. > > The logs on osd.595 and osd.7 still fill up with the same messages. > > So what does "peering_blocked_by_detail" show in that case since it > can no longer show "peering_blocked_by_history_les_bound"? > > > > > Regards, > > > > George > > ________________________________ > > From: Corentin Bonneton [[email protected]] > > Sent: 08 February 2017 16:31 > > To: Vasilakakos, George (STFC,RAL,SC) > > Cc: [email protected] > > Subject: Re: [ceph-users] PG stuck peering after host reboot > > > > Hello, > > > > I already had the case, I applied the parameter > (osd_find_best_info_ignore_history_les) to all the osd that have reported the > queries blocked. > > > > -- > > Cordialement, > > CEO FEELB | Corentin BONNETON > > [email protected]<mailto:[email protected]> > > > > Le 8 févr. 2017 à 17:17, > [email protected]<mailto:[email protected]> a écrit : > > > > Hi Ceph folks, > > > > I have a cluster running Jewel 10.2.5 using a mix EC and replicated > pools. > > > > After rebooting a host last night, one PG refuses to complete peering > > > > pg 1.323 is stuck inactive for 73352.498493, current state peering, > last acting [595,1391,240,127,937,362,267,320,7,634,716] > > > > Restarting OSDs or hosts does nothing to help, or sometimes results in > things like this: > > > > pg 1.323 is remapped+peering, acting > [2147483647,1391,240,127,937,362,267,320,7,634,716] > > > > > > The host that was rebooted is home to osd.7 (8). If I go onto it to > look at the logs for osd.7 this is what I see: > > > > $ tail -f /var/log/ceph/ceph-osd.7.log > > 2017-02-08 15:41:00.445247 7f5fcc2bd700 0 -- > XXX.XXX.XXX.172:6905/20510 >> XXX.XXX.XXX.192:6921/55371 pipe(0x7f6074a0b400 > sd=34 :42828 s=2 pgs=319 cs=471 l=0 c=0x7f6070086700).fault, initiating > reconnect > > > > I'm assuming that in IP1:port1/PID1 >> IP2:port2/PID2 the >> indicates > the direction of communication. I've traced these to osd.7 (rank 8 in the > stuck PG) reaching out to osd.595 (the primary in the stuck PG). > > > > Meanwhile, looking at the logs of osd.595 I see this: > > > > $ tail -f /var/log/ceph/ceph-osd.595.log > > 2017-02-08 15:41:15.760708 7f1765673700 0 -- > XXX.XXX.XXX.192:6921/55371 >> XXX.XXX.XXX.172:6905/20510 pipe(0x7f17b2911400 > sd=101 :6921 s=0 pgs=0 cs=0 l=0 c=0x7f17b7beaf00).accept connect_seq 478 vs > existing 477 state standby > > 2017-02-08 15:41:20.768844 7f1765673700 0 bad crc in front 1941070384 > != exp 3786596716 > > > > which again shows osd.595 reaching out to osd.7 and from what I could > gather the CRC problem is about messaging. > > > > Google searching has yielded nothing particularly useful on how to get > this unstuck. > > > > ceph pg 1.323 query seems to hang forever but it completed once last > night and I noticed this: > > > > "peering_blocked_by_detail": [ > > { > > "detail": "peering_blocked_by_history_les_bound" > > } > > > > We have seen this before and it was cleared by setting > osd_find_best_info_ignore_history_les to true for the first two OSDs on the > stuck PGs (this was on a 3 replica pool). This hasn't worked in this case and > I suspect the option needs to be set on either a majority of OSDs or enough k > number of OSDs to be able to use their data and ignore history. > > > > We would really appreciate any guidance and/or help the community can > offer! > > > > _______________________________________________ > > ceph-users mailing list > > [email protected] > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > > > -- > Cheers, > Brad > > -- Cheers, Brad _______________________________________________ ceph-users mailing list [email protected] http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com _______________________________________________ ceph-users mailing list [email protected] http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
