Mine was easier than I thought, turns out it was a bunch of rados client
connections stuck trying to do bench cleanup on a no-longer existing pool,
probably endlessly trying to find where the (also no longer existing) osds
it needs to talk to are.


On Tue, 17 Dec 2019 at 12:41, Rafael Lopez <rafael.lo...@monash.edu> wrote:

> Hi guys,
>
> I am running red hat ceph (basically luminous - ceph version
> 12.2.12-48.el7cp (26388d73d88602005946d4381cc5796d42904858)) and am seeing
> something similar on our test cluster.
>
> One of the mons is running at around 300% cpu non stop. It doesn't seem to
> be the lead mon or one in particular, but the cpu load shifts to another
> mon if high load mon is restarted.
> I thought it might be related to this thread since it seems to have
> started happening when removing and adding a lot of OSDs. In fact I have
> removed and added several times all the OSDs in the cluster, and mons have
> been restarted several times but the load persists.
>
> At debug_mon 20/5, I see endless lines of this which seems to be to do
> with the osdmap:
>
> 2019-12-17 11:59:47.916098 7f27dfba1700 10 mon.mon1@1(peon) e4
> handle_get_version mon_get_version(what=osdmap handle=2874836684) v1
> 2019-12-17 11:59:47.916139 7f27dfba1700 20 mon.mon1@1(peon) e4
> _ms_dispatch existing session 0x55ab61fb6300 for client.27824428
> 10.0.0.2:0/461841538
> 2019-12-17 11:59:47.916146 7f27dfba1700 20 mon.mon1@1(peon) e4  caps
> allow *
> 2019-12-17 11:59:47.916149 7f27dfba1700 20 is_capable service=mon command=
> read on cap allow *
> 2019-12-17 11:59:47.916151 7f27dfba1700 20  allow so far , doing grant
> allow *
> 2019-12-17 11:59:47.916152 7f27dfba1700 20  allow all
> 2019-12-17 11:59:47.916153 7f27dfba1700 10 mon.mon1@1(peon) e4
> handle_get_version mon_get_version(what=osdmap handle=2871621985) v1
> 2019-12-17 11:59:47.916203 7f27dfba1700 20 mon.mon1@1(peon) e4
> _ms_dispatch existing session 0x55ab61d7c780 for client.27824430
> 10.0.0.2:0/898487246
> 2019-12-17 11:59:47.916210 7f27dfba1700 20 mon.mon1@1(peon) e4  caps
> allow *
> 2019-12-17 11:59:47.916213 7f27dfba1700 20 is_capable service=mon command=
> read on cap allow *
> 2019-12-17 11:59:47.916215 7f27dfba1700 20  allow so far , doing grant
> allow *
> 2019-12-17 11:59:47.916216 7f27dfba1700 20  allow all
> 2019-12-17 11:59:47.916217 7f27dfba1700 10 mon.mon1@1(peon) e4
> handle_get_version mon_get_version(what=osdmap handle=2882637609) v1
> 2019-12-17 11:59:47.916254 7f27dfba1700 20 mon.mon1@1(peon) e4
> _ms_dispatch existing session 0x55ab62649c80 for client.27824431
> 10.0.0.2:0/972633098
> 2019-12-17 11:59:47.916262 7f27dfba1700 20 mon.mon1@1(peon) e4  caps
> allow *
> 2019-12-17 11:59:47.916266 7f27dfba1700 20 is_capable service=mon command=
> read on cap allow *
> 2019-12-17 11:59:47.916268 7f27dfba1700 20  allow so far , doing grant
> allow *
> 2019-12-17 11:59:47.916269 7f27dfba1700 20  allow all
>
> Continuing to investigate.
>
> Raf
>
> On Tue, 17 Dec 2019 at 11:53, Sasha Litvak <alexander.v.lit...@gmail.com>
> wrote:
>
>> Bryan, thank you.  We are about to start testing 14.2.2 -> 14.2.5
>> upgrade, so folks here are a bit cautious :-)  We don't need to convert but
>> may have to rebuild few disks after an upgrade.
>>
>> On Mon, Dec 16, 2019 at 3:57 PM Bryan Stillwell <bstillw...@godaddy.com>
>> wrote:
>>
>>> Sasha,
>>>
>>> I was able to get past it by restarting the ceph-mon processes every
>>> time it got stuck, but that's not a very good solution for a production
>>> cluster.
>>>
>>> Right now I'm trying to narrow down what is causing the problem.
>>> Rebuilding the OSDs with BlueStore doesn't seem to be enough.  I believe it
>>> could be related to us using the extra space on the journal device as an
>>> SSD-based OSD.  During the conversion process I'm removing this SSD-based
>>> OSD (since with BlueStore the omap data is ending up on the SSD anyways),
>>> and I'm suspecting it might be causing this problem.
>>>
>>> Bryan
>>>
>>> On Dec 14, 2019, at 10:27 AM, Sasha Litvak <alexander.v.lit...@gmail.com>
>>> wrote:
>>>
>>> Notice: This email is from an external sender.
>>>
>>> Bryan,
>>>
>>> Were you able to resolve this?  If yes, can you please share with the
>>> list?
>>>
>>> On Fri, Dec 13, 2019 at 10:08 AM Bryan Stillwell <bstillw...@godaddy.com>
>>> wrote:
>>>
>>>> Adding the dev list since it seems like a bug in 14.2.5.
>>>>
>>>> I was able to capture the output from perf top:
>>>>
>>>>   21.58%  libceph-common.so.0               [.]
>>>> ceph::buffer::v14_2_0::list::append
>>>>   20.90%  libstdc++.so.6.0.19               [.] std::getline<char,
>>>> std::char_traits<char>, std::allocator<char> >
>>>>   13.25%  libceph-common.so.0               [.]
>>>> ceph::buffer::v14_2_0::list::append
>>>>   10.11%  libstdc++.so.6.0.19               [.]
>>>> std::istream::sentry::sentry
>>>>    8.94%  libstdc++.so.6.0.19               [.] std::basic_ios<char,
>>>> std::char_traits<char> >::clear
>>>>    3.24%  libceph-common.so.0               [.]
>>>> ceph::buffer::v14_2_0::ptr::unused_tail_length
>>>>    1.69%  libceph-common.so.0               [.] std::getline<char,
>>>> std::char_traits<char>, std::allocator<char> >@plt
>>>>    1.63%  libstdc++.so.6.0.19               [.]
>>>> std::istream::sentry::sentry@plt
>>>>    1.21%  [kernel]                          [k] __do_softirq
>>>>    0.77%  libpython2.7.so.1.0               [.] PyEval_EvalFrameEx
>>>>    0.55%  [kernel]                          [k]
>>>> _raw_spin_unlock_irqrestore
>>>>
>>>> I increased mon debugging to 20 and nothing stuck out to me.
>>>>
>>>> Bryan
>>>>
>>>> > On Dec 12, 2019, at 4:46 PM, Bryan Stillwell <bstillw...@godaddy.com>
>>>> wrote:
>>>> >
>>>> > On our test cluster after upgrading to 14.2.5 I'm having problems
>>>> with the mons pegging a CPU core while moving data around.  I'm currently
>>>> converting the OSDs from FileStore to BlueStore by marking the OSDs out in
>>>> multiple nodes, destroying the OSDs, and then recreating them with
>>>> ceph-volume lvm batch.  This seems too get the ceph-mon process into a
>>>> state where it pegs a CPU core on one of the mons:
>>>> >
>>>> > 1764450 ceph      20   0 4802412   2.1g  16980 S 100.0 28.1   4:54.72
>>>> ceph-mon
>>>> >
>>>> > Has anyone else run into this with 14.2.5 yet?  I didn't see this
>>>> problem while the cluster was running 14.2.4.
>>>> >
>>>> > Thanks,
>>>> > Bryan
>>>> _______________________________________________
>>>> ceph-users mailing list -- ceph-users@ceph.io
>>>> To unsubscribe send an email to ceph-users-le...@ceph.io
>>>
>>>
>>> _______________________________________________
>> ceph-users mailing list -- ceph-users@ceph.io
>> To unsubscribe send an email to ceph-users-le...@ceph.io
>>
>
>
> --
> *Rafael Lopez*
> Research Devops Engineer
> Monash University eResearch Centre
>
> E: rafael.lo...@monash.edu
>
>

-- 
*Rafael Lopez*
Research Devops Engineer
Monash University eResearch Centre

T: +61 3 9905 9118
M: +61 (0)427682670 <%2B61%204%2027682%20670>
E: rafael.lo...@monash.edu
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

Reply via email to