Try running  gstack  on the ceph mgr process when it is frozen?
This could be a name resolution problem, as you suspect. Maybe gstack will
show where the process is 'stuck'and this might be a call to your name
resolution service.

On Tue, 27 Aug 2019 at 14:25, Jake Grimmett <j...@mrc-lmb.cam.ac.uk> wrote:

> Whoops, I'm running Scientific Linux 7.6, going to upgrade to 7.7. soon...
>
> thanks
>
> Jake
>
>
> On 8/27/19 2:22 PM, Jake Grimmett wrote:
> > Hi Reed,
> >
> > That exactly matches what I'm seeing:
> >
> > when iostat is working OK, I see ~5% CPU use by ceph-mgr
> > and when iostat freezes, ceph-mgr CPU increases to 100%
> >
> > regarding OS, I'm using Scientific Linux 7.7
> > Kernel 3.10.0-957.21.3.el7.x86_64
> >
> > I'm not sure if the mgr initiates scrubbing, but if so, this could be
> > the cause of the "HEALTH_WARN 20 pgs not deep-scrubbed in time" that we
> see.
> >
> > Anyhow, many thanks for your input, please let me know if you have
> > further ideas :)
> >
> > best,
> >
> > Jake
> >
> > On 8/27/19 2:01 PM, Reed Dier wrote:
> >> Curious what dist you're running on, as I've been having similar issues
> with instability in the mgr as well, curious if any similar threads to pull
> at.
> >>
> >> While the iostat command is running, is the active mgr using 100% CPU
> in top?
> >>
> >> Reed
> >>
> >>> On Aug 27, 2019, at 6:41 AM, Jake Grimmett <j...@mrc-lmb.cam.ac.uk>
> wrote:
> >>>
> >>> Dear All,
> >>>
> >>> We have a new Nautilus (14.2.2) cluster, with 328 OSDs spread over 40
> nodes.
> >>>
> >>> Unfortunately "ceph iostat" spends most of it's time frozen, with
> >>> occasional periods of working normally for less than a minute, then
> >>> freeze again for a couple of minutes, then come back to life, and so so
> >>> on...
> >>>
> >>> No errors are seen on screen, unless I press CTRL+C when iostat is
> stalled:
> >>>
> >>> [root@ceph-s3 ~]# ceph iostat
> >>> ^CInterrupted
> >>> Traceback (most recent call last):
> >>>  File "/usr/bin/ceph", line 1263, in <module>
> >>>    retval = main()
> >>>  File "/usr/bin/ceph", line 1194, in main
> >>>    verbose)
> >>>  File "/usr/bin/ceph", line 619, in new_style_command
> >>>    ret, outbuf, outs = do_command(parsed_args, target, cmdargs,
> >>> sigdict, inbuf, verbose)
> >>>  File "/usr/bin/ceph", line 593, in do_command
> >>>    return ret, '', ''
> >>> UnboundLocalError: local variable 'ret' referenced before assignment
> >>>
> >>> Observations:
> >>>
> >>> 1) This problem does not seem to be related to load on the cluster.
> >>>
> >>> 2) When iostat is stalled the dashboard is also non-responsive, if
> >>> iostat is working, the dashboard also works.
> >>>
> >>> Presumably the iostat and dashboard problems are due to the same
> >>> underlying fault? Perhaps a problem with the mgr?
> >>>
> >>>
> >>> 3) With iostat working, tailing /var/log/ceph/ceph-mgr.ceph-s3.log
> >>> shows:
> >>>
> >>> 2019-08-27 09:09:56.817 7f8149834700  0 log_channel(audit) log [DBG] :
> >>> from='client.4120202 -' entity='client.admin' cmd=[{"width": 95,
> >>> "prefix": "iostat", "poll": true, "target": ["mgr", ""],
> "print_header":
> >>> false}]: dispatch
> >>>
> >>> 4) When iostat isn't working, we see no obvious errors in the mgr log.
> >>>
> >>> 5) When the dashboard is not working, mgr log sometimes shows:
> >>>
> >>> 2019-08-27 09:18:18.810 7f813e533700  0 mgr[dashboard]
> >>> [::ffff:10.91.192.36:43606] [GET] [500] [2.724s] [jake] [1.6K]
> >>> /api/health/minimal
> >>> 2019-08-27 09:18:18.887 7f813e533700  0 mgr[dashboard] ['{"status":
> "500
> >>> Internal Server Error", "version": "3.2.2", "detail": "The server
> >>> encountered an unexpected condition which prevented it from fulfilling
> >>> the request.", "traceback": "Traceback (most recent call last):\\n
> File
> >>> \\"/usr/lib/python2.7/site-packages/cherrypy/_cprequest.py\\", line
> 656,
> >>> in respond\\n    response.body = self.handler()\\n  File
> >>> \\"/usr/lib/python2.7/site-packages/cherrypy/lib/encoding.py\\", line
> >>> 188, in __call__\\n    self.body = self.oldhandler(*args, **kwargs)\\n
> >>> File \\"/usr/lib/python2.7/site-packages/cherrypy/_cptools.py\\", line
> >>> 221, in wrap\\n    return self.newhandler(innerfunc, *args,
> **kwargs)\\n
> >>> File \\"/usr/share/ceph/mgr/dashboard/services/exception.py\\", line
> >>> 88, in dashboard_exception_handler\\n    return handler(*args,
> >>> **kwargs)\\n  File
> >>> \\"/usr/lib/python2.7/site-packages/cherrypy/_cpdispatch.py\\", line
> 34,
> >>> in __call__\\n    return self.callable(*self.args, **self.kwargs)\\n
> >>> File \\"/usr/share/ceph/mgr/dashboard/controllers/__init__.py\\", line
> >>> 649, in inner\\n    ret = func(*args, **kwargs)\\n  File
> >>> \\"/usr/share/ceph/mgr/dashboard/controllers/health.py\\", line 192, in
> >>> minimal\\n    return self.health_minimal.all_health()\\n  File
> >>> \\"/usr/share/ceph/mgr/dashboard/controllers/health.py\\", line 51, in
> >>> all_health\\n    result[\'pools\'] = self.pools()\\n  File
> >>> \\"/usr/share/ceph/mgr/dashboard/controllers/health.py\\", line 167, in
> >>> pools\\n    pools = CephService.get_pool_list_with_stats()\\n  File
> >>> \\"/usr/share/ceph/mgr/dashboard/services/ceph_service.py\\", line 124,
> >>> in get_pool_list_with_stats\\n    \'series\': [i for i in
> >>> stat_series]\\nRuntimeError: deque mutated during iteration\\n"}']
> >>>
> >>>
> >>> 6) IPV6 is normally disabled on our machines at the kernel level, via
> >>> grubby --update-kernel=ALL --args="ipv6.disable=1"
> >>>
> >>> This was done as 'disabling ipv6' interfered with the dashboard (giving
> >>> "HEALTH_ERR Module 'dashboard' has failed: error('No socket could be
> >>> created',) we re-enabling ipv6 on the mgr nodes only to fix this.
> >>>
> >>>
> >>> Ideas...?
> >>>
> >>> Should ipv6 be enabled, even if not configured, on all ceph nodes?
> >>>
> >>> Any ideas on fixing this gratefully received!
> >>>
> >>> many thanks
> >>>
> >>> Jake
> >>>
> >>> --
> >>> MRC Laboratory of Molecular Biology
> >>> Francis Crick Avenue,
> >>> Cambridge CB2 0QH, UK.
> >>>
> >>> _______________________________________________
> >>> ceph-users mailing list
> >>> ceph-users@lists.ceph.com
> >>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >>
> >
> >
>
>
> --
>
> MRC Laboratory of Molecular Biology
> Francis Crick Avenue,
> Cambridge CB2 0QH, UK.
>
> _______________________________________________
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to