My current working theory is that the mgr is getting hung up when it tries to scrape the device metrics from the mon. The 'tell' mechanism used to send mon-targetted commands is pretty kludgey/broken in nautilus and earlier. It's been rewritten for octopus, but isn't worth backporting--it never really caused problems until the devicemanager started using it heavily.
In any case, this PR just disables scraping of mon devices for nautilus: https://github.com/ceph/ceph/pull/31446 There is a build queued at https://shaman.ceph.com/repos/ceph/wip-no-scrape-mons-nautilus/d592e56e$ which should get packages in 1-2 hours. Perhaps you can install that package on the mgr host and try again to reproduce it again? I noticed a few other oddities in the logs while looking through them, like https://tracker.ceph.com/issues/42666 which will hopefully have a fix ready for 14.2.5. I'm not sure about that auth error message, though! sage On Sat, 2 Nov 2019, Oliver Freyermuth wrote: > Dear Sage, > > good news - it happened again, with debug logs! > There's nothing obvious to my eye, it's uploaded as: > 0b2d0c09-46f3-4126-aa27-e2d2e8572741 > It seems the failure was roughly in parallel to me wanting to access the > dashboard. It must have happened within the last ~5-10 minutes of the log. > > I'll now go back to "stable operation", in case you need anything else, just > let me know. > > Cheers and all the best, > Oliver > > Am 02.11.19 um 17:38 schrieb Oliver Freyermuth: > > Dear Sage, > > > > at least for the simple case: > > ceph device get-health-metrics osd.11 > > => mgr crashes (but in that case, it crashes fully, i.e. the process is > > gone) > > I have now uploaded a verbose log as: > > ceph-post-file: e3bd60ad-cbce-4308-8b07-7ebe7998572e > > > > One potential cause of this (and maybe the other issues) might be because > > some of our OSDs are on non-JBOD controllers and hence are made by forming > > a Raid 0 per disk, > > so a simple smartctl on the device will not work (but -dmegaraid,<number> > > would be needed). > > > > Now I have both mgrs active again, debug logging on, device health metrics > > on again, > > and am waiting for them to become silent again. Let's hope the issue > > reappears before the disks run full of logs ;-). > > > > Cheers, > > Oliver > > > > Am 02.11.19 um 02:56 schrieb Sage Weil: > >> On Sat, 2 Nov 2019, Oliver Freyermuth wrote: > >>> Dear Cephers, > >>> > >>> interestingly, after: > >>> ceph device monitoring off > >>> the mgrs seem to be stable now - the active one still went silent a few > >>> minutes later, > >>> but the standby took over and was stable, and restarting the broken one, > >>> it's now stable since an hour, too, > >>> so probably, a restart of the mgr is needed after disabling device > >>> monitoring to get things stable again. > >>> > >>> So it seems to be caused by a problem with the device health metrics. In > >>> case this is a red herring and mgrs become instable again in the next > >>> days, > >>> I'll let you know. > >> > >> If this seems to stabilize things, and you can tolerate inducing the > >> failure again, reproducing the problem with mgr logs cranked up (debug_mgr > >> = 20, debug_ms = 1) would probably give us a good idea of why the mgr is > >> hanging. Let us know! > >> > >> Thanks, > >> sage > >> > >> > > >>> Cheers, > >>> Oliver > >>> > >>> Am 01.11.19 um 23:09 schrieb Oliver Freyermuth: > >>>> Dear Cephers, > >>>> > >>>> this is a 14.2.4 cluster with device health metrics enabled - since > >>>> about a day, all mgr daemons go "silent" on me after a few hours, i.e. > >>>> "ceph -s" shows: > >>>> > >>>> cluster: > >>>> id: 269cf2b2-7e7c-4ceb-bd1b-a33d915ceee9 > >>>> health: HEALTH_WARN > >>>> no active mgr > >>>> 1/3 mons down, quorum mon001,mon002 > >>>> > >>>> services: > >>>> mon: 3 daemons, quorum mon001,mon002 (age 57m), out of > >>>> quorum: mon003 > >>>> mgr: no daemons active (since 56m) > >>>> ... > >>>> (the third mon has a planned outage and will come back in a few days) > >>>> > >>>> Checking the logs of the mgr daemons, I find some "reset" messages at > >>>> the time when it goes "silent", first for the first mgr: > >>>> > >>>> 2019-11-01 21:34:40.286 7f2df6a6b700 0 log_channel(cluster) log [DBG] : > >>>> pgmap v1798: 1585 pgs: 1585 active+clean; 1.1 TiB data, 2.3 TiB used, > >>>> 136 TiB / 138 TiB avail > >>>> 2019-11-01 21:34:41.458 7f2e0d59b700 0 client.0 ms_handle_reset on > >>>> v2:10.160.16.1:6800/401248 > >>>> 2019-11-01 21:34:42.287 7f2df6a6b700 0 log_channel(cluster) log [DBG] : > >>>> pgmap v1799: 1585 pgs: 1585 active+clean; 1.1 TiB data, 2.3 TiB used, > >>>> 136 TiB / 138 TiB avail > >>>> > >>>> and a bit later, on the standby mgr: > >>>> > >>>> 2019-11-01 22:18:14.892 7f7bcc8ae700 0 log_channel(cluster) log [DBG] : > >>>> pgmap v1798: 1585 pgs: 166 active+clean+snaptrim, 858 > >>>> active+clean+snaptrim_wait, 561 active+clean; 1.1 TiB data, 2.3 TiB > >>>> used, 136 TiB / 138 TiB avail > >>>> 2019-11-01 22:18:16.022 7f7be9e72700 0 client.0 ms_handle_reset on > >>>> v2:10.160.16.2:6800/352196 > >>>> 2019-11-01 22:18:16.893 7f7bcc8ae700 0 log_channel(cluster) log [DBG] : > >>>> pgmap v1799: 1585 pgs: 166 active+clean+snaptrim, 858 > >>>> active+clean+snaptrim_wait, 561 active+clean; 1.1 TiB data, 2.3 TiB > >>>> used, 136 TiB / 138 TiB avail > >>>> > >>>> Interestingly, the dashboard still works, but presents outdated > >>>> information, and for example zero I/O going on. > >>>> I believe this started to happen mainly after the third mon went into > >>>> the known downtime, but I am not fully sure if this was the trigger, > >>>> since the cluster is still growing. > >>>> It may also have been the addition of 24 more OSDs. > >>>> > >>>> > >>>> I also find other messages in the mgr logs which seem problematic, but I > >>>> am not sure they are related: > >>>> ------------------------------ > >>>> 2019-11-01 21:17:09.849 7f2df4266700 0 mgr[devicehealth] Error reading > >>>> OMAP: [errno 22] Failed to operate read op for oid > >>>> Traceback (most recent call last): > >>>> File "/usr/share/ceph/mgr/devicehealth/module.py", line 396, in > >>>> put_device_metrics > >>>> ioctx.operate_read_op(op, devid) > >>>> File "rados.pyx", line 516, in rados.requires.wrapper.validate_func > >>>> (/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/14.2.4/rpm/el7/BUIL > >>>> D/ceph-14.2.4/build/src/pybind/rados/pyrex/rados.c:4721) > >>>> File "rados.pyx", line 3474, in rados.Ioctx.operate_read_op > >>>> (/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/14.2.4/rpm/el7/BUILD/ceph-14.2.4/build/src/pybind/rados/pyrex/rados.c:36554) > >>>> InvalidArgumentError: [errno 22] Failed to operate read op for oid > >>>> ------------------------------ > >>>> or: > >>>> ------------------------------ > >>>> 2019-11-01 21:33:53.977 7f7bd38bc700 0 mgr[devicehealth] Fail to parse > >>>> JSON result from daemon osd.51 () > >>>> 2019-11-01 21:33:53.978 7f7bd38bc700 0 mgr[devicehealth] Fail to parse > >>>> JSON result from daemon osd.52 () > >>>> 2019-11-01 21:33:53.979 7f7bd38bc700 0 mgr[devicehealth] Fail to parse > >>>> JSON result from daemon osd.53 () > >>>> ------------------------------ > >>>> > >>>> The reason why I am cautious about the health metrics is that I observed > >>>> a crash when trying to query them: > >>>> ------------------------------ > >>>> 2019-11-01 20:21:23.661 7fa46314a700 0 log_channel(audit) log [DBG] : > >>>> from='client.174136 -' entity='client.admin' cmd=[{"prefix": "device > >>>> get-health-metrics", "devid": "osd.11", "target": ["mgr", ""]}]: dispatch > >>>> 2019-11-01 20:21:23.661 7fa46394b700 0 mgr[devicehealth] handle_command > >>>> 2019-11-01 20:21:23.663 7fa46394b700 -1 *** Caught signal (Segmentation > >>>> fault) ** > >>>> in thread 7fa46394b700 thread_name:mgr-fin > >>>> > >>>> ceph version 14.2.4 (75f4de193b3ea58512f204623e6c5a16e6c1e1ba) nautilus > >>>> (stable) > >>>> 1: (()+0xf5f0) [0x7fa488cee5f0] > >>>> 2: (PyEval_EvalFrameEx()+0x1a9) [0x7fa48aeb50f9] > >>>> 3: (PyEval_EvalFrameEx()+0x67bd) [0x7fa48aebb70d] > >>>> 4: (PyEval_EvalFrameEx()+0x67bd) [0x7fa48aebb70d] > >>>> 5: (PyEval_EvalFrameEx()+0x67bd) [0x7fa48aebb70d] > >>>> 6: (PyEval_EvalCodeEx()+0x7ed) [0x7fa48aebe08d] > >>>> 7: (()+0x709c8) [0x7fa48ae479c8] > >>>> 8: (PyObject_Call()+0x43) [0x7fa48ae22ab3] > >>>> 9: (()+0x5aaa5) [0x7fa48ae31aa5] > >>>> 10: (PyObject_Call()+0x43) [0x7fa48ae22ab3] > >>>> 11: (()+0x4bb95) [0x7fa48ae22b95] > >>>> 12: (PyObject_CallMethod()+0xbb) [0x7fa48ae22ecb] > >>>> 13: (ActivePyModule::handle_command(std::map<std::string, > >>>> boost::variant<std::string, bool, long, double, std::vector<std::string, > >>>> std::allocator<std::string> >, std::vector<long, std::allocator<long> >, > >>>> std::vector<double, std::allocator<double> > >, std::less<void>, > >>>> std::allocator<std::pair<std::string const, boost::variant<std::string, > >>>> bool, long, double, std::vector<std::string, std::allocator<std::string> > >>>> >, std::vector<long, std::allocator<long> >, std::vector<double, > >>>> std::allocator<double> > > > > > const&, ceph::buffer::v14_2_0::list > >>>> const&, std::basic_stringstream<char, std::char_traits<char>, > >>>> std::allocator<char> >*, std::basic_stringstream<char, > >>>> std::char_traits<char>, std::allocator<char> >*)+0x20e) [0x55c3c1fefc5e] > >>>> 14: (()+0x16c23d) [0x55c3c204023d] > >>>> 15: (FunctionContext::finish(int)+0x2c) [0x55c3c2001eac] > >>>> 16: (Context::complete(int)+0x9) [0x55c3c1ffe659] > >>>> 17: (Finisher::finisher_thread_entry()+0x156) [0x7fa48b439cc6] > >>>> 18: (()+0x7e65) [0x7fa488ce6e65] > >>>> 19: (clone()+0x6d) [0x7fa48799488d] > >>>> NOTE: a copy of the executable, or `objdump -rdS <executable>` is > >>>> needed to interpret this. > >>>> ------------------------------ > >>>> > >>>> I have issued: > >>>> ceph device monitoring off > >>>> for now and will keep waiting to see if mgrs go silent again. If there > >>>> are any better ideas or this issue is known, let me know. > >>>> > >>>> Cheers, > >>>> Oliver > >>>> > >>>> > >>>> _______________________________________________ > >>>> ceph-users mailing list -- ceph-users@ceph.io > >>>> To unsubscribe send an email to ceph-users-le...@ceph.io > >>>> > >>> > > > > > > _______________________________________________ > > ceph-users mailing list -- ceph-users@ceph.io > > To unsubscribe send an email to ceph-users-le...@ceph.io > > > > > _______________________________________________ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io