We are also running into this issue on one of our clusters - balancer
mode upmap, about 950 OSDs.
Andras
On 12/18/19 4:44 PM, Bryan Stillwell wrote:
On Dec 18, 2019, at 11:58 AM, Sage Weil <s...@newdream.net
<mailto:s...@newdream.net>> wrote:
On Wed, 18 Dec 2019, Bryan Stillwell wrote:
After upgrading one of our clusters from Nautilus 14.2.2 to Nautilus
14.2.5 I'm seeing 100% CPU usage by a single ceph-mgr thread (found
using 'top -H'). Attaching to the thread with strace shows a lot of
mmap and munmap calls. Here's the distribution after watching it
for a few minutes:
48.73% - mmap
49.48% - munmap
1.75% - futex
0.05% - madvise
I've upgraded 3 other clusters so far (120 OSDs, 30 OSDs, 200 OSDs),
but this is the only one which has seen the problem (355 OSDs).
Perhaps it has something to do with its size?
I was suspecting it might have to do with one of the modules
misbehaving, so I disabled all of them:
# ceph mgr module ls | jq -r '.enabled_modules'
[]
But that didn't help (I restarted the mgrs after disabling the
modules too).
I also tried setting debug_mgr and debug_mgrc to 20, but nothing
popped out at me as being the cause of the problem.
It only seems to affect the active mgr. If I stop the active mgr
the problem moves to one of the other mgrs.
Any guesses or tips on what next steps I should take to figure out
what's going on?
What are the balancer modes on the affected and unaffected cluster(s)?
Affected cluster has a balancer mode of "none".
The other three are "upmap", "none", and "upmap".
I don't know if you saw in ceph-users, but this bug report seems to
point at the finisher-Mgr thread:
https://tracker.ceph.com/issues/43364
Thanks,
Bryan
_______________________________________________
Dev mailing list -- d...@ceph.io
To unsubscribe send an email to dev-le...@ceph.io
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io