[ceph-users] Re: mgr daemons becoming unresponsive

Oliver Freyermuth Wed, 06 Nov 2019 01:44:09 -0800

Hi together,

interestingly, now that the third mon is missing for almost a week (those 
planned interventions always take longer than expected...),
we get mgr failovers (but without crashes).


In the mgr log, I find:

2019-11-06 07:50:05.409 7fce8a0dc700  0 client.0 ms_handle_reset on 
v2:10.160.16.1:6800/618072
...
... the normal churning ...
...
2019-11-06 07:52:44.113 7fce8a0dc700 -1 mgr handle_mgr_map I was active but no 
longer am
2019-11-06 07:52:44.113 7fce8a0dc700  1 mgr respawn  e: '/usr/bin/ceph-mgr'

In the mon log, I see:
...
2019-11-06 07:44:11.565 7f1f44453700  4 rocksdb: [db/db_impl_files.cc:356] [JOB 
225] Try to delete WAL files size 10830909, prev total WAL file size 10839895, 
number of live WAL files 2.

2019-11-06 07:44:11.565 7f1f3a43f700  4 rocksdb: 
[db/db_impl_compaction_flush.cc:1403] [default] Manual compaction starting
2019-11-06 07:44:11.565 7f1f44c54700  4 rocksdb: (Original Log Time 
2019/11/06-07:44:11.565802) [db/db_impl_compaction_flush.cc:2374] [default] 
Manual compaction from level-0 to level-6 from 'mgrstat .. 'mgrstat; will stop 
at (end)
...
2019-11-06 07:50:36.734 7f1f3a43f700  4 rocksdb: 
[db/db_impl_compaction_flush.cc:1403] [default] Manual compaction starting
2019-11-06 07:52:27.046 7f1f4144d700  0 log_channel(cluster) log [INF] : 
Manager daemon mon001 is unresponsive, replacing it with standby daemon mon002
...

There's a lot of compaction going on (probably due to the prolonged HEALTH_WARN 
state, so not really unexpected)
so I wonder whether the actual cause for identifying the mgr as "unresponsive" 
is the heavy compaction on the mons.
It will be interesting to see what happens when we finally have the third mon 
back and the cluster becomes healthy again...

Did somebody see something similar after running for a week or more with 
Nautilus on old and slow hardware?

Cheers,
        Oliver

Am 02.11.19 um 18:20 schrieb Oliver Freyermuth:

Dear Sage,

good news - it happened again, with debug logs!
There's nothing obvious to my eye, it's uploaded as:
0b2d0c09-46f3-4126-aa27-e2d2e8572741
It seems the failure was roughly in parallel to me wanting to access the 
dashboard. It must have happened within the last ~5-10 minutes of the log.

I'll now go back to "stable operation", in case you need anything else, just 
let me know.

Cheers and all the best,
        Oliver

Am 02.11.19 um 17:38 schrieb Oliver Freyermuth:

Dear Sage,

at least for the simple case:
  ceph device get-health-metrics osd.11
=> mgr crashes (but in that case, it crashes fully, i.e. the process is gone)
I have now uploaded a verbose log as:
ceph-post-file: e3bd60ad-cbce-4308-8b07-7ebe7998572e

One potential cause of this (and maybe the other issues) might be because some 
of our OSDs are on non-JBOD controllers and hence are made by forming a Raid 0 
per disk,
so a simple smartctl on the device will not work (but -dmegaraid,<number> would 
be needed).

Now I have both mgrs active again, debug logging on, device health metrics on 
again,
and am waiting for them to become silent again. Let's hope the issue reappears 
before the disks run full of logs ;-).

Cheers,
        Oliver

Am 02.11.19 um 02:56 schrieb Sage Weil:

On Sat, 2 Nov 2019, Oliver Freyermuth wrote:

Dear Cephers,

interestingly, after:
  ceph device monitoring off
the mgrs seem to be stable now - the active one still went silent a few minutes 
later,
but the standby took over and was stable, and restarting the broken one, it's 
now stable since an hour, too,
so probably, a restart of the mgr is needed after disabling device monitoring 
to get things stable again.

So it seems to be caused by a problem with the device health metrics. In case 
this is a red herring and mgrs become instable again in the next days,
I'll let you know.


If this seems to stabilize things, and you can tolerate inducing the
failure again, reproducing the problem with mgr logs cranked up (debug_mgr
= 20, debug_ms = 1) would probably give us a good idea of why the mgr is
hanging.  Let us know!

Thanks,
sage

  >

Cheers,
        Oliver

Am 01.11.19 um 23:09 schrieb Oliver Freyermuth:

Dear Cephers,

this is a 14.2.4 cluster with device health metrics enabled - since about a day, all mgr daemons go 
"silent" on me after a few hours, i.e. "ceph -s" shows:

   cluster:
     id:     269cf2b2-7e7c-4ceb-bd1b-a33d915ceee9
     health: HEALTH_WARN
             no active mgr
             1/3 mons down, quorum mon001,mon002

services:

     mon:        3 daemons, quorum mon001,mon002 (age 57m), out of quorum: 
mon003
     mgr:        no daemons active (since 56m)
     ...
(the third mon has a planned outage and will come back in a few days)

Checking the logs of the mgr daemons, I find some "reset" messages at the time when it 
goes "silent", first for the first mgr:

2019-11-01 21:34:40.286 7f2df6a6b700  0 log_channel(cluster) log [DBG] : pgmap 
v1798: 1585 pgs: 1585 active+clean; 1.1 TiB data, 2.3 TiB used, 136 TiB / 138 
TiB avail
2019-11-01 21:34:41.458 7f2e0d59b700  0 client.0 ms_handle_reset on 
v2:10.160.16.1:6800/401248
2019-11-01 21:34:42.287 7f2df6a6b700  0 log_channel(cluster) log [DBG] : pgmap 
v1799: 1585 pgs: 1585 active+clean; 1.1 TiB data, 2.3 TiB used, 136 TiB / 138 
TiB avail

and a bit later, on the standby mgr:

2019-11-01 22:18:14.892 7f7bcc8ae700  0 log_channel(cluster) log [DBG] : pgmap 
v1798: 1585 pgs: 166 active+clean+snaptrim, 858 active+clean+snaptrim_wait, 561 
active+clean; 1.1 TiB data, 2.3 TiB used, 136 TiB / 138 TiB avail
2019-11-01 22:18:16.022 7f7be9e72700  0 client.0 ms_handle_reset on 
v2:10.160.16.2:6800/352196
2019-11-01 22:18:16.893 7f7bcc8ae700  0 log_channel(cluster) log [DBG] : pgmap 
v1799: 1585 pgs: 166 active+clean+snaptrim, 858 active+clean+snaptrim_wait, 561 
active+clean; 1.1 TiB data, 2.3 TiB used, 136 TiB / 138 TiB avail

Interestingly, the dashboard still works, but presents outdated information, 
and for example zero I/O going on.
I believe this started to happen mainly after the third mon went into the known 
downtime, but I am not fully sure if this was the trigger, since the cluster is 
still growing.
It may also have been the addition of 24 more OSDs.


I also find other messages in the mgr logs which seem problematic, but I am not 
sure they are related:
------------------------------
2019-11-01 21:17:09.849 7f2df4266700  0 mgr[devicehealth] Error reading OMAP: 
[errno 22] Failed to operate read op for oid
Traceback (most recent call last):
   File "/usr/share/ceph/mgr/devicehealth/module.py", line 396, in 
put_device_metrics
     ioctx.operate_read_op(op, devid)
   File "rados.pyx", line 516, in rados.requires.wrapper.validate_func 
(/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/14.2.4/rpm/el7/BUIL
D/ceph-14.2.4/build/src/pybind/rados/pyrex/rados.c:4721)
   File "rados.pyx", line 3474, in rados.Ioctx.operate_read_op 
(/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/14.2.4/rpm/el7/BUILD/ceph-14.2.4/build/src/pybind/rados/pyrex/rados.c:36554)
InvalidArgumentError: [errno 22] Failed to operate read op for oid
------------------------------
or:
------------------------------
2019-11-01 21:33:53.977 7f7bd38bc700  0 mgr[devicehealth] Fail to parse JSON 
result from daemon osd.51 ()
2019-11-01 21:33:53.978 7f7bd38bc700  0 mgr[devicehealth] Fail to parse JSON 
result from daemon osd.52 ()
2019-11-01 21:33:53.979 7f7bd38bc700  0 mgr[devicehealth] Fail to parse JSON 
result from daemon osd.53 ()
------------------------------

The reason why I am cautious about the health metrics is that I observed a 
crash when trying to query them:
------------------------------
2019-11-01 20:21:23.661 7fa46314a700  0 log_channel(audit) log [DBG] : from='client.174136 -' entity='client.admin' cmd=[{"prefix": 
"device get-health-metrics", "devid": "osd.11", "target": ["mgr", ""]}]: dispatch
2019-11-01 20:21:23.661 7fa46394b700  0 mgr[devicehealth] handle_command
2019-11-01 20:21:23.663 7fa46394b700 -1 *** Caught signal (Segmentation fault) 
**
  in thread 7fa46394b700 thread_name:mgr-fin

  ceph version 14.2.4 (75f4de193b3ea58512f204623e6c5a16e6c1e1ba) nautilus 
(stable)
  1: (()+0xf5f0) [0x7fa488cee5f0]
  2: (PyEval_EvalFrameEx()+0x1a9) [0x7fa48aeb50f9]
  3: (PyEval_EvalFrameEx()+0x67bd) [0x7fa48aebb70d]
  4: (PyEval_EvalFrameEx()+0x67bd) [0x7fa48aebb70d]
  5: (PyEval_EvalFrameEx()+0x67bd) [0x7fa48aebb70d]
  6: (PyEval_EvalCodeEx()+0x7ed) [0x7fa48aebe08d]
  7: (()+0x709c8) [0x7fa48ae479c8]
  8: (PyObject_Call()+0x43) [0x7fa48ae22ab3]
  9: (()+0x5aaa5) [0x7fa48ae31aa5]
  10: (PyObject_Call()+0x43) [0x7fa48ae22ab3]
  11: (()+0x4bb95) [0x7fa48ae22b95]
  12: (PyObject_CallMethod()+0xbb) [0x7fa48ae22ecb]
  13: (ActivePyModule::handle_command(std::map<std::string, boost::variant<std::string, bool, long, double, std::vector<std::string, std::allocator<std::string> >, std::vector<long, std::allocator<long> 
>, std::vector<double, std::allocator<double> > >, std::less<void>, std::allocator<std::pair<std::string const, boost::variant<std::string, bool, long, double, std::vector<std::string, 
std::allocator<std::string> >, std::vector<long, std::allocator<long> >, std::vector<double, std::allocator<double> > > > > > const&, ceph::buffer::v14_2_0::list const&, 
std::basic_stringstream<char, std::char_traits<char>, std::allocator<char> >*, std::basic_stringstream<char, std::char_traits<char>, std::allocator<char> >*)+0x20e) [0x55c3c1fefc5e]
  14: (()+0x16c23d) [0x55c3c204023d]
  15: (FunctionContext::finish(int)+0x2c) [0x55c3c2001eac]
  16: (Context::complete(int)+0x9) [0x55c3c1ffe659]
  17: (Finisher::finisher_thread_entry()+0x156) [0x7fa48b439cc6]
  18: (()+0x7e65) [0x7fa488ce6e65]
  19: (clone()+0x6d) [0x7fa48799488d]
  NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to 
interpret this.
------------------------------

I have issued:
ceph device monitoring off
for now and will keep waiting to see if mgrs go silent again. If there are any 
better ideas or this issue is known, let me know.

Cheers,
        Oliver


_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io



_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io




_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

smime.p7s
Description: S/MIME Cryptographic Signature

_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: mgr daemons becoming unresponsive

Reply via email to