Thanks Eugen, I created this bug report to track the issue if you want to watch it:
https://tracker.ceph.com/issues/42971 Bryan > On Nov 22, 2019, at 6:34 AM, Eugen Block <ebl...@nde.ag> wrote: > > Notice: This email is from an external sender. > > > > Hi, > > we have also been facing some problems with MGR, we had to switch off > balancer and pg_autoscaler because the active MGR would end up using a > whole CPU, resulting in hanging dashboard and ceph commands. There are > several similar threads on the ML, e.g. [1] and [2]. > > I'm not aware of a solution yet so I'll stick with disabled balancer > for now since the current pg placement is fine. > > Regards, > Eugen > > > [1] https://www.mail-archive.com/ceph-users@lists.ceph.com/msg56994.html > [2] https://www.mail-archive.com/ceph-users@lists.ceph.com/msg56890.html > > Zitat von Bryan Stillwell <bstillw...@godaddy.com>: > >> On multiple clusters we are seeing the mgr hang frequently when the >> balancer is enabled. It seems that the balancer is getting caught >> in some kind of infinite loop which chews up all the CPU for the mgr >> which causes problems with other modules like prometheus (we don't >> have the devicehealth module enabled yet). >> >> I've been able to reproduce the issue doing an offline balance as >> well using the osdmaptool: >> >> osdmaptool --debug-osd 10 osd.map --upmap balance-upmaps.sh >> --upmap-pool default.rgw.buckets.data --upmap-max 100 >> >> It seems to loop over the same group of PGs of ~7,000 PGs over and >> over again like this without finding any new upmaps that can be added: >> >> 2019-11-19 16:39:11.131518 7f85a156f300 10 trying 24.d91 >> 2019-11-19 16:39:11.138035 7f85a156f300 10 trying 24.2e3c >> 2019-11-19 16:39:11.144162 7f85a156f300 10 trying 24.176b >> 2019-11-19 16:39:11.149671 7f85a156f300 10 trying 24.ac6 >> 2019-11-19 16:39:11.155115 7f85a156f300 10 trying 24.2cb2 >> 2019-11-19 16:39:11.160508 7f85a156f300 10 trying 24.129c >> 2019-11-19 16:39:11.166287 7f85a156f300 10 trying 24.181f >> 2019-11-19 16:39:11.171737 7f85a156f300 10 trying 24.3cb1 >> 2019-11-19 16:39:11.177260 7f85a156f300 10 24.2177 already has >> pg_upmap_items [368,271] >> 2019-11-19 16:39:11.177268 7f85a156f300 10 trying 24.2177 >> 2019-11-19 16:39:11.182590 7f85a156f300 10 trying 24.a4 >> 2019-11-19 16:39:11.188053 7f85a156f300 10 trying 24.2583 >> 2019-11-19 16:39:11.193545 7f85a156f300 10 24.93e already has >> pg_upmap_items [80,27] >> 2019-11-19 16:39:11.193553 7f85a156f300 10 trying 24.93e >> 2019-11-19 16:39:11.198858 7f85a156f300 10 trying 24.e67 >> 2019-11-19 16:39:11.204224 7f85a156f300 10 trying 24.16d9 >> 2019-11-19 16:39:11.209844 7f85a156f300 10 trying 24.11dc >> 2019-11-19 16:39:11.215303 7f85a156f300 10 trying 24.1f3d >> 2019-11-19 16:39:11.221074 7f85a156f300 10 trying 24.2a57 >> >> >> While this cluster is running Luminous (12.2.12), I've reproduced >> the loop using the same osdmap on Nautilus (14.2.4). Is there >> somewhere I can privately upload the osdmap for someone to >> troubleshoot the problem? >> >> Thanks, >> Bryan >> _______________________________________________ >> ceph-users mailing list -- ceph-users@ceph.io >> To unsubscribe send an email to ceph-users-le...@ceph.io > > > _______________________________________________ > ceph-users mailing list -- ceph-users@ceph.io > To unsubscribe send an email to ceph-users-le...@ceph.io _______________________________________________ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io