Try running gstack on the ceph mgr process when it is frozen? This could be a name resolution problem, as you suspect. Maybe gstack will show where the process is 'stuck'and this might be a call to your name resolution service.
On Tue, 27 Aug 2019 at 14:25, Jake Grimmett <j...@mrc-lmb.cam.ac.uk> wrote: > Whoops, I'm running Scientific Linux 7.6, going to upgrade to 7.7. soon... > > thanks > > Jake > > > On 8/27/19 2:22 PM, Jake Grimmett wrote: > > Hi Reed, > > > > That exactly matches what I'm seeing: > > > > when iostat is working OK, I see ~5% CPU use by ceph-mgr > > and when iostat freezes, ceph-mgr CPU increases to 100% > > > > regarding OS, I'm using Scientific Linux 7.7 > > Kernel 3.10.0-957.21.3.el7.x86_64 > > > > I'm not sure if the mgr initiates scrubbing, but if so, this could be > > the cause of the "HEALTH_WARN 20 pgs not deep-scrubbed in time" that we > see. > > > > Anyhow, many thanks for your input, please let me know if you have > > further ideas :) > > > > best, > > > > Jake > > > > On 8/27/19 2:01 PM, Reed Dier wrote: > >> Curious what dist you're running on, as I've been having similar issues > with instability in the mgr as well, curious if any similar threads to pull > at. > >> > >> While the iostat command is running, is the active mgr using 100% CPU > in top? > >> > >> Reed > >> > >>> On Aug 27, 2019, at 6:41 AM, Jake Grimmett <j...@mrc-lmb.cam.ac.uk> > wrote: > >>> > >>> Dear All, > >>> > >>> We have a new Nautilus (14.2.2) cluster, with 328 OSDs spread over 40 > nodes. > >>> > >>> Unfortunately "ceph iostat" spends most of it's time frozen, with > >>> occasional periods of working normally for less than a minute, then > >>> freeze again for a couple of minutes, then come back to life, and so so > >>> on... > >>> > >>> No errors are seen on screen, unless I press CTRL+C when iostat is > stalled: > >>> > >>> [root@ceph-s3 ~]# ceph iostat > >>> ^CInterrupted > >>> Traceback (most recent call last): > >>> File "/usr/bin/ceph", line 1263, in <module> > >>> retval = main() > >>> File "/usr/bin/ceph", line 1194, in main > >>> verbose) > >>> File "/usr/bin/ceph", line 619, in new_style_command > >>> ret, outbuf, outs = do_command(parsed_args, target, cmdargs, > >>> sigdict, inbuf, verbose) > >>> File "/usr/bin/ceph", line 593, in do_command > >>> return ret, '', '' > >>> UnboundLocalError: local variable 'ret' referenced before assignment > >>> > >>> Observations: > >>> > >>> 1) This problem does not seem to be related to load on the cluster. > >>> > >>> 2) When iostat is stalled the dashboard is also non-responsive, if > >>> iostat is working, the dashboard also works. > >>> > >>> Presumably the iostat and dashboard problems are due to the same > >>> underlying fault? Perhaps a problem with the mgr? > >>> > >>> > >>> 3) With iostat working, tailing /var/log/ceph/ceph-mgr.ceph-s3.log > >>> shows: > >>> > >>> 2019-08-27 09:09:56.817 7f8149834700 0 log_channel(audit) log [DBG] : > >>> from='client.4120202 -' entity='client.admin' cmd=[{"width": 95, > >>> "prefix": "iostat", "poll": true, "target": ["mgr", ""], > "print_header": > >>> false}]: dispatch > >>> > >>> 4) When iostat isn't working, we see no obvious errors in the mgr log. > >>> > >>> 5) When the dashboard is not working, mgr log sometimes shows: > >>> > >>> 2019-08-27 09:18:18.810 7f813e533700 0 mgr[dashboard] > >>> [::ffff:10.91.192.36:43606] [GET] [500] [2.724s] [jake] [1.6K] > >>> /api/health/minimal > >>> 2019-08-27 09:18:18.887 7f813e533700 0 mgr[dashboard] ['{"status": > "500 > >>> Internal Server Error", "version": "3.2.2", "detail": "The server > >>> encountered an unexpected condition which prevented it from fulfilling > >>> the request.", "traceback": "Traceback (most recent call last):\\n > File > >>> \\"/usr/lib/python2.7/site-packages/cherrypy/_cprequest.py\\", line > 656, > >>> in respond\\n response.body = self.handler()\\n File > >>> \\"/usr/lib/python2.7/site-packages/cherrypy/lib/encoding.py\\", line > >>> 188, in __call__\\n self.body = self.oldhandler(*args, **kwargs)\\n > >>> File \\"/usr/lib/python2.7/site-packages/cherrypy/_cptools.py\\", line > >>> 221, in wrap\\n return self.newhandler(innerfunc, *args, > **kwargs)\\n > >>> File \\"/usr/share/ceph/mgr/dashboard/services/exception.py\\", line > >>> 88, in dashboard_exception_handler\\n return handler(*args, > >>> **kwargs)\\n File > >>> \\"/usr/lib/python2.7/site-packages/cherrypy/_cpdispatch.py\\", line > 34, > >>> in __call__\\n return self.callable(*self.args, **self.kwargs)\\n > >>> File \\"/usr/share/ceph/mgr/dashboard/controllers/__init__.py\\", line > >>> 649, in inner\\n ret = func(*args, **kwargs)\\n File > >>> \\"/usr/share/ceph/mgr/dashboard/controllers/health.py\\", line 192, in > >>> minimal\\n return self.health_minimal.all_health()\\n File > >>> \\"/usr/share/ceph/mgr/dashboard/controllers/health.py\\", line 51, in > >>> all_health\\n result[\'pools\'] = self.pools()\\n File > >>> \\"/usr/share/ceph/mgr/dashboard/controllers/health.py\\", line 167, in > >>> pools\\n pools = CephService.get_pool_list_with_stats()\\n File > >>> \\"/usr/share/ceph/mgr/dashboard/services/ceph_service.py\\", line 124, > >>> in get_pool_list_with_stats\\n \'series\': [i for i in > >>> stat_series]\\nRuntimeError: deque mutated during iteration\\n"}'] > >>> > >>> > >>> 6) IPV6 is normally disabled on our machines at the kernel level, via > >>> grubby --update-kernel=ALL --args="ipv6.disable=1" > >>> > >>> This was done as 'disabling ipv6' interfered with the dashboard (giving > >>> "HEALTH_ERR Module 'dashboard' has failed: error('No socket could be > >>> created',) we re-enabling ipv6 on the mgr nodes only to fix this. > >>> > >>> > >>> Ideas...? > >>> > >>> Should ipv6 be enabled, even if not configured, on all ceph nodes? > >>> > >>> Any ideas on fixing this gratefully received! > >>> > >>> many thanks > >>> > >>> Jake > >>> > >>> -- > >>> MRC Laboratory of Molecular Biology > >>> Francis Crick Avenue, > >>> Cambridge CB2 0QH, UK. > >>> > >>> _______________________________________________ > >>> ceph-users mailing list > >>> ceph-users@lists.ceph.com > >>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > >> > > > > > > > -- > > MRC Laboratory of Molecular Biology > Francis Crick Avenue, > Cambridge CB2 0QH, UK. > > _______________________________________________ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >
_______________________________________________ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com