On Sat, 2 Nov 2019, Oliver Freyermuth wrote:
> Dear Cephers,
> 
> interestingly, after:
>  ceph device monitoring off
> the mgrs seem to be stable now - the active one still went silent a few 
> minutes later,
> but the standby took over and was stable, and restarting the broken one, it's 
> now stable since an hour, too,
> so probably, a restart of the mgr is needed after disabling device monitoring 
> to get things stable again. 
> 
> So it seems to be caused by a problem with the device health metrics. In case 
> this is a red herring and mgrs become instable again in the next days,
> I'll let you know. 

If this seems to stabilize things, and you can tolerate inducing the 
failure again, reproducing the problem with mgr logs cranked up (debug_mgr 
= 20, debug_ms = 1) would probably give us a good idea of why the mgr is 
hanging.  Let us know!

Thanks,
sage

 > 
> Cheers,
>       Oliver
> 
> Am 01.11.19 um 23:09 schrieb Oliver Freyermuth:
> > Dear Cephers,
> > 
> > this is a 14.2.4 cluster with device health metrics enabled - since about a 
> > day, all mgr daemons go "silent" on me after a few hours, i.e. "ceph -s" 
> > shows:
> > 
> >   cluster:
> >     id:     269cf2b2-7e7c-4ceb-bd1b-a33d915ceee9
> >     health: HEALTH_WARN
> >             no active mgr
> >             1/3 mons down, quorum mon001,mon002
> >  
> >   services:
> >     mon:        3 daemons, quorum mon001,mon002 (age 57m), out of quorum: 
> > mon003
> >     mgr:        no daemons active (since 56m)
> >     ...
> > (the third mon has a planned outage and will come back in a few days)
> > 
> > Checking the logs of the mgr daemons, I find some "reset" messages at the 
> > time when it goes "silent", first for the first mgr:
> > 
> > 2019-11-01 21:34:40.286 7f2df6a6b700  0 log_channel(cluster) log [DBG] : 
> > pgmap v1798: 1585 pgs: 1585 active+clean; 1.1 TiB data, 2.3 TiB used, 136 
> > TiB / 138 TiB avail
> > 2019-11-01 21:34:41.458 7f2e0d59b700  0 client.0 ms_handle_reset on 
> > v2:10.160.16.1:6800/401248
> > 2019-11-01 21:34:42.287 7f2df6a6b700  0 log_channel(cluster) log [DBG] : 
> > pgmap v1799: 1585 pgs: 1585 active+clean; 1.1 TiB data, 2.3 TiB used, 136 
> > TiB / 138 TiB avail
> > 
> > and a bit later, on the standby mgr:
> > 
> > 2019-11-01 22:18:14.892 7f7bcc8ae700  0 log_channel(cluster) log [DBG] : 
> > pgmap v1798: 1585 pgs: 166 active+clean+snaptrim, 858 
> > active+clean+snaptrim_wait, 561 active+clean; 1.1 TiB data, 2.3 TiB used, 
> > 136 TiB / 138 TiB avail
> > 2019-11-01 22:18:16.022 7f7be9e72700  0 client.0 ms_handle_reset on 
> > v2:10.160.16.2:6800/352196
> > 2019-11-01 22:18:16.893 7f7bcc8ae700  0 log_channel(cluster) log [DBG] : 
> > pgmap v1799: 1585 pgs: 166 active+clean+snaptrim, 858 
> > active+clean+snaptrim_wait, 561 active+clean; 1.1 TiB data, 2.3 TiB used, 
> > 136 TiB / 138 TiB avail
> > 
> > Interestingly, the dashboard still works, but presents outdated 
> > information, and for example zero I/O going on. 
> > I believe this started to happen mainly after the third mon went into the 
> > known downtime, but I am not fully sure if this was the trigger, since the 
> > cluster is still growing. 
> > It may also have been the addition of 24 more OSDs. 
> > 
> > 
> > I also find other messages in the mgr logs which seem problematic, but I am 
> > not sure they are related:
> > ------------------------------
> > 2019-11-01 21:17:09.849 7f2df4266700  0 mgr[devicehealth] Error reading 
> > OMAP: [errno 22] Failed to operate read op for oid 
> > Traceback (most recent call last):
> >   File "/usr/share/ceph/mgr/devicehealth/module.py", line 396, in 
> > put_device_metrics
> >     ioctx.operate_read_op(op, devid)
> >   File "rados.pyx", line 516, in rados.requires.wrapper.validate_func 
> > (/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/14.2.4/rpm/el7/BUIL
> > D/ceph-14.2.4/build/src/pybind/rados/pyrex/rados.c:4721)
> >   File "rados.pyx", line 3474, in rados.Ioctx.operate_read_op 
> > (/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/14.2.4/rpm/el7/BUILD/ceph-14.2.4/build/src/pybind/rados/pyrex/rados.c:36554)
> > InvalidArgumentError: [errno 22] Failed to operate read op for oid 
> > ------------------------------
> > or:
> > ------------------------------
> > 2019-11-01 21:33:53.977 7f7bd38bc700  0 mgr[devicehealth] Fail to parse 
> > JSON result from daemon osd.51 ()
> > 2019-11-01 21:33:53.978 7f7bd38bc700  0 mgr[devicehealth] Fail to parse 
> > JSON result from daemon osd.52 ()
> > 2019-11-01 21:33:53.979 7f7bd38bc700  0 mgr[devicehealth] Fail to parse 
> > JSON result from daemon osd.53 ()
> > ------------------------------
> > 
> > The reason why I am cautious about the health metrics is that I observed a 
> > crash when trying to query them:
> > ------------------------------
> > 2019-11-01 20:21:23.661 7fa46314a700  0 log_channel(audit) log [DBG] : 
> > from='client.174136 -' entity='client.admin' cmd=[{"prefix": "device 
> > get-health-metrics", "devid": "osd.11", "target": ["mgr", ""]}]: dispatch
> > 2019-11-01 20:21:23.661 7fa46394b700  0 mgr[devicehealth] handle_command
> > 2019-11-01 20:21:23.663 7fa46394b700 -1 *** Caught signal (Segmentation 
> > fault) **
> >  in thread 7fa46394b700 thread_name:mgr-fin
> > 
> >  ceph version 14.2.4 (75f4de193b3ea58512f204623e6c5a16e6c1e1ba) nautilus 
> > (stable)
> >  1: (()+0xf5f0) [0x7fa488cee5f0]
> >  2: (PyEval_EvalFrameEx()+0x1a9) [0x7fa48aeb50f9]
> >  3: (PyEval_EvalFrameEx()+0x67bd) [0x7fa48aebb70d]
> >  4: (PyEval_EvalFrameEx()+0x67bd) [0x7fa48aebb70d]
> >  5: (PyEval_EvalFrameEx()+0x67bd) [0x7fa48aebb70d]
> >  6: (PyEval_EvalCodeEx()+0x7ed) [0x7fa48aebe08d]
> >  7: (()+0x709c8) [0x7fa48ae479c8]
> >  8: (PyObject_Call()+0x43) [0x7fa48ae22ab3]
> >  9: (()+0x5aaa5) [0x7fa48ae31aa5]
> >  10: (PyObject_Call()+0x43) [0x7fa48ae22ab3]
> >  11: (()+0x4bb95) [0x7fa48ae22b95]
> >  12: (PyObject_CallMethod()+0xbb) [0x7fa48ae22ecb]
> >  13: (ActivePyModule::handle_command(std::map<std::string, 
> > boost::variant<std::string, bool, long, double, std::vector<std::string, 
> > std::allocator<std::string> >, std::vector<long, std::allocator<long> >, 
> > std::vector<double, std::allocator<double> > >, std::less<void>, 
> > std::allocator<std::pair<std::string const, boost::variant<std::string, 
> > bool, long, double, std::vector<std::string, std::allocator<std::string> >, 
> > std::vector<long, std::allocator<long> >, std::vector<double, 
> > std::allocator<double> > > > > > const&, ceph::buffer::v14_2_0::list 
> > const&, std::basic_stringstream<char, std::char_traits<char>, 
> > std::allocator<char> >*, std::basic_stringstream<char, 
> > std::char_traits<char>, std::allocator<char> >*)+0x20e) [0x55c3c1fefc5e]
> >  14: (()+0x16c23d) [0x55c3c204023d]
> >  15: (FunctionContext::finish(int)+0x2c) [0x55c3c2001eac]
> >  16: (Context::complete(int)+0x9) [0x55c3c1ffe659]
> >  17: (Finisher::finisher_thread_entry()+0x156) [0x7fa48b439cc6]
> >  18: (()+0x7e65) [0x7fa488ce6e65]
> >  19: (clone()+0x6d) [0x7fa48799488d]
> >  NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed 
> > to interpret this.
> > ------------------------------
> > 
> > I have issued:
> > ceph device monitoring off
> > for now and will keep waiting to see if mgrs go silent again. If there are 
> > any better ideas or this issue is known, let me know. 
> > 
> > Cheers,
> >     Oliver
> > 
> > 
> > _______________________________________________
> > ceph-users mailing list -- ceph-users@ceph.io
> > To unsubscribe send an email to ceph-users-le...@ceph.io
> > 
> 
> 
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

Reply via email to