Mine was easier than I thought, turns out it was a bunch of rados client connections stuck trying to do bench cleanup on a no-longer existing pool, probably endlessly trying to find where the (also no longer existing) osds it needs to talk to are.
On Tue, 17 Dec 2019 at 12:41, Rafael Lopez <rafael.lo...@monash.edu> wrote: > Hi guys, > > I am running red hat ceph (basically luminous - ceph version > 12.2.12-48.el7cp (26388d73d88602005946d4381cc5796d42904858)) and am seeing > something similar on our test cluster. > > One of the mons is running at around 300% cpu non stop. It doesn't seem to > be the lead mon or one in particular, but the cpu load shifts to another > mon if high load mon is restarted. > I thought it might be related to this thread since it seems to have > started happening when removing and adding a lot of OSDs. In fact I have > removed and added several times all the OSDs in the cluster, and mons have > been restarted several times but the load persists. > > At debug_mon 20/5, I see endless lines of this which seems to be to do > with the osdmap: > > 2019-12-17 11:59:47.916098 7f27dfba1700 10 mon.mon1@1(peon) e4 > handle_get_version mon_get_version(what=osdmap handle=2874836684) v1 > 2019-12-17 11:59:47.916139 7f27dfba1700 20 mon.mon1@1(peon) e4 > _ms_dispatch existing session 0x55ab61fb6300 for client.27824428 > 10.0.0.2:0/461841538 > 2019-12-17 11:59:47.916146 7f27dfba1700 20 mon.mon1@1(peon) e4 caps > allow * > 2019-12-17 11:59:47.916149 7f27dfba1700 20 is_capable service=mon command= > read on cap allow * > 2019-12-17 11:59:47.916151 7f27dfba1700 20 allow so far , doing grant > allow * > 2019-12-17 11:59:47.916152 7f27dfba1700 20 allow all > 2019-12-17 11:59:47.916153 7f27dfba1700 10 mon.mon1@1(peon) e4 > handle_get_version mon_get_version(what=osdmap handle=2871621985) v1 > 2019-12-17 11:59:47.916203 7f27dfba1700 20 mon.mon1@1(peon) e4 > _ms_dispatch existing session 0x55ab61d7c780 for client.27824430 > 10.0.0.2:0/898487246 > 2019-12-17 11:59:47.916210 7f27dfba1700 20 mon.mon1@1(peon) e4 caps > allow * > 2019-12-17 11:59:47.916213 7f27dfba1700 20 is_capable service=mon command= > read on cap allow * > 2019-12-17 11:59:47.916215 7f27dfba1700 20 allow so far , doing grant > allow * > 2019-12-17 11:59:47.916216 7f27dfba1700 20 allow all > 2019-12-17 11:59:47.916217 7f27dfba1700 10 mon.mon1@1(peon) e4 > handle_get_version mon_get_version(what=osdmap handle=2882637609) v1 > 2019-12-17 11:59:47.916254 7f27dfba1700 20 mon.mon1@1(peon) e4 > _ms_dispatch existing session 0x55ab62649c80 for client.27824431 > 10.0.0.2:0/972633098 > 2019-12-17 11:59:47.916262 7f27dfba1700 20 mon.mon1@1(peon) e4 caps > allow * > 2019-12-17 11:59:47.916266 7f27dfba1700 20 is_capable service=mon command= > read on cap allow * > 2019-12-17 11:59:47.916268 7f27dfba1700 20 allow so far , doing grant > allow * > 2019-12-17 11:59:47.916269 7f27dfba1700 20 allow all > > Continuing to investigate. > > Raf > > On Tue, 17 Dec 2019 at 11:53, Sasha Litvak <alexander.v.lit...@gmail.com> > wrote: > >> Bryan, thank you. We are about to start testing 14.2.2 -> 14.2.5 >> upgrade, so folks here are a bit cautious :-) We don't need to convert but >> may have to rebuild few disks after an upgrade. >> >> On Mon, Dec 16, 2019 at 3:57 PM Bryan Stillwell <bstillw...@godaddy.com> >> wrote: >> >>> Sasha, >>> >>> I was able to get past it by restarting the ceph-mon processes every >>> time it got stuck, but that's not a very good solution for a production >>> cluster. >>> >>> Right now I'm trying to narrow down what is causing the problem. >>> Rebuilding the OSDs with BlueStore doesn't seem to be enough. I believe it >>> could be related to us using the extra space on the journal device as an >>> SSD-based OSD. During the conversion process I'm removing this SSD-based >>> OSD (since with BlueStore the omap data is ending up on the SSD anyways), >>> and I'm suspecting it might be causing this problem. >>> >>> Bryan >>> >>> On Dec 14, 2019, at 10:27 AM, Sasha Litvak <alexander.v.lit...@gmail.com> >>> wrote: >>> >>> Notice: This email is from an external sender. >>> >>> Bryan, >>> >>> Were you able to resolve this? If yes, can you please share with the >>> list? >>> >>> On Fri, Dec 13, 2019 at 10:08 AM Bryan Stillwell <bstillw...@godaddy.com> >>> wrote: >>> >>>> Adding the dev list since it seems like a bug in 14.2.5. >>>> >>>> I was able to capture the output from perf top: >>>> >>>> 21.58% libceph-common.so.0 [.] >>>> ceph::buffer::v14_2_0::list::append >>>> 20.90% libstdc++.so.6.0.19 [.] std::getline<char, >>>> std::char_traits<char>, std::allocator<char> > >>>> 13.25% libceph-common.so.0 [.] >>>> ceph::buffer::v14_2_0::list::append >>>> 10.11% libstdc++.so.6.0.19 [.] >>>> std::istream::sentry::sentry >>>> 8.94% libstdc++.so.6.0.19 [.] std::basic_ios<char, >>>> std::char_traits<char> >::clear >>>> 3.24% libceph-common.so.0 [.] >>>> ceph::buffer::v14_2_0::ptr::unused_tail_length >>>> 1.69% libceph-common.so.0 [.] std::getline<char, >>>> std::char_traits<char>, std::allocator<char> >@plt >>>> 1.63% libstdc++.so.6.0.19 [.] >>>> std::istream::sentry::sentry@plt >>>> 1.21% [kernel] [k] __do_softirq >>>> 0.77% libpython2.7.so.1.0 [.] PyEval_EvalFrameEx >>>> 0.55% [kernel] [k] >>>> _raw_spin_unlock_irqrestore >>>> >>>> I increased mon debugging to 20 and nothing stuck out to me. >>>> >>>> Bryan >>>> >>>> > On Dec 12, 2019, at 4:46 PM, Bryan Stillwell <bstillw...@godaddy.com> >>>> wrote: >>>> > >>>> > On our test cluster after upgrading to 14.2.5 I'm having problems >>>> with the mons pegging a CPU core while moving data around. I'm currently >>>> converting the OSDs from FileStore to BlueStore by marking the OSDs out in >>>> multiple nodes, destroying the OSDs, and then recreating them with >>>> ceph-volume lvm batch. This seems too get the ceph-mon process into a >>>> state where it pegs a CPU core on one of the mons: >>>> > >>>> > 1764450 ceph 20 0 4802412 2.1g 16980 S 100.0 28.1 4:54.72 >>>> ceph-mon >>>> > >>>> > Has anyone else run into this with 14.2.5 yet? I didn't see this >>>> problem while the cluster was running 14.2.4. >>>> > >>>> > Thanks, >>>> > Bryan >>>> _______________________________________________ >>>> ceph-users mailing list -- ceph-users@ceph.io >>>> To unsubscribe send an email to ceph-users-le...@ceph.io >>> >>> >>> _______________________________________________ >> ceph-users mailing list -- ceph-users@ceph.io >> To unsubscribe send an email to ceph-users-le...@ceph.io >> > > > -- > *Rafael Lopez* > Research Devops Engineer > Monash University eResearch Centre > > E: rafael.lo...@monash.edu > > -- *Rafael Lopez* Research Devops Engineer Monash University eResearch Centre T: +61 3 9905 9118 M: +61 (0)427682670 <%2B61%204%2027682%20670> E: rafael.lo...@monash.edu
_______________________________________________ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io