Re: [ceph-users] PG stuck peering after host reboot

Corentin Bonneton Wed, 08 Feb 2017 08:31:58 -0800

Hello,

I already had the case, I applied the parameter 
(osd_find_best_info_ignore_history_les) to all the osd that have reported the 
queries blocked.


--
Cordialement,
CEO FEELB | Corentin BONNETON
cont...@feelb.io

> Le 8 févr. 2017 à 17:17, george.vasilaka...@stfc.ac.uk a écrit :
> 
> Hi Ceph folks,
> 
> I have a cluster running Jewel 10.2.5 using a mix EC and replicated pools.
> 
> After rebooting a host last night, one PG refuses to complete peering
> 
> pg 1.323 is stuck inactive for 73352.498493, current state peering, last 
> acting [595,1391,240,127,937,362,267,320,7,634,716]
> 
> Restarting OSDs or hosts does nothing to help, or sometimes results in things 
> like this:
> 
> pg 1.323 is remapped+peering, acting 
> [2147483647,1391,240,127,937,362,267,320,7,634,716]
> 
> 
> The host that was rebooted is home to osd.7 (8). If I go onto it to look at 
> the logs for osd.7 this is what I see:
> 
> $ tail -f /var/log/ceph/ceph-osd.7.log
> 2017-02-08 15:41:00.445247 7f5fcc2bd700  0 -- XXX.XXX.XXX.172:6905/20510 >> 
> XXX.XXX.XXX.192:6921/55371 pipe(0x7f6074a0b400 sd=34 :42828 s=2 pgs=319 
> cs=471 l=0 c=0x7f6070086700).fault, initiating reconnect
> 
> I'm assuming that in IP1:port1/PID1 >> IP2:port2/PID2 the >> indicates the 
> direction of communication. I've traced these to osd.7 (rank 8 in the stuck 
> PG) reaching out to osd.595 (the primary in the stuck PG).
> 
> Meanwhile, looking at the logs of osd.595 I see this:
> 
> $ tail -f /var/log/ceph/ceph-osd.595.log
> 2017-02-08 15:41:15.760708 7f1765673700  0 -- XXX.XXX.XXX.192:6921/55371 >> 
> XXX.XXX.XXX.172:6905/20510 pipe(0x7f17b2911400 sd=101 :6921 s=0 pgs=0 cs=0 
> l=0 c=0x7f17b7beaf00).accept connect_seq 478 vs existing 477 state standby
> 2017-02-08 15:41:20.768844 7f1765673700  0 bad crc in front 1941070384 != exp 
> 3786596716
> 
> which again shows osd.595 reaching out to osd.7 and from what I could gather 
> the CRC problem is about messaging.
> 
> Google searching has yielded nothing particularly useful on how to get this 
> unstuck.
> 
> ceph pg 1.323 query seems to hang forever but it completed once last night 
> and I noticed this:
> 
>            "peering_blocked_by_detail": [
>                {
>                    "detail": "peering_blocked_by_history_les_bound"
>                }
> 
> We have seen this before and it was cleared by setting 
> osd_find_best_info_ignore_history_les to true for the first two OSDs on the 
> stuck PGs (this was on a 3 replica pool). This hasn't worked in this case and 
> I suspect the option needs to be set on either a majority of OSDs or enough k 
> number of OSDs to be able to use their data and ignore history.
> 
> We would really appreciate any guidance and/or help the community can offer!

_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] PG stuck peering after host reboot

Reply via email to