Hi, I am using Ceph nautilus cluster and i found one of my OSD node running 3 OSD's service suddenly went down and it was very slow typing command. I killed ceph-osd process and system become normal and started all OSD service.
After that it becomes normal I figure out that due low memory the system was slow but there is no OOM in any of the logs. I can see heartbeat_check and suicide time out in below msg in OSD logs, but why did this happened? what could be the issue? *osd.1* 2019-10-16 20:49:38.433 7f246f302700 0 bluestore(/var/lib/ceph/osd/ceph-1) log_latency_fn slow operation observed for _txc_committed_kv, latency = 6.20965s, txc = 0x55 a2541f1500 2019-10-16 20:50:05.233 7f247bf7d700 1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f245e4e6700' had timed out after 15 2019-10-16 20:50:06.721 7f247af7b700 1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f245e4e6700' had timed out after 15 2019-10-16 20:53:57.881 7f2475d15700 -1 osd.1 2035 heartbeat_check: no reply from 192.168.100.113:6805 osd.2 since back 2019-10-16 20:49:44.711088 front 2019-10-16 20:4 9:47.684146 (oldest deadline 2019-10-16 20:50:07.937387) 2019-10-16 20:55:26.997 7f2475d15700 -1 osd.1 2035 heartbeat_check: no reply from 192.168.100.111:6821 osd.3 since back 2019-10-16 20:49:44.711031 front 2019-10-16 20:4 9:44.711495 (oldest deadline 2019-10-16 20:50:07.937387) 2019-10-16 20:56:25.513 7f2475d15700 -1 osd.1 2035 heartbeat_check: no reply from 192.168.100.112:6805 osd.4 since back 2019-10-16 20:49:44.484031 front 2019-10-16 20:4 9:44.711337 (oldest deadline 2019-10-16 20:50:03.162338) 2019-10-16 20:57:52.557 7f2475d15700 -1 osd.1 2035 heartbeat_check: no reply from 192.168.100.113:6813 osd.5 since back 2019-10-16 20:49:44.484064 front 2019-10-16 20:4 9:44.711504 (oldest deadline 2019-10-16 20:50:03.162338) 2019-10-16 20:59:22.837 7f2475d15700 -1 osd.1 2035 heartbeat_check: no reply from 192.168.100.111:6805 osd.6 since back 2019-10-16 20:49:44.711483 front 2019-10-16 20:4 9:44.483337 (oldest deadline 2019-10-16 20:50:03.162338) 2019-10-16 21:01:18.761 7f2475d15700 -1 osd.1 2035 heartbeat_check: no reply from 192.168.100.112:6821 osd.7 since back 2019-10-16 20:49:44.711275 front 2019-10-16 20:4 9:44.711352 (oldest deadline 2019-10-16 20:50:07.937387) 2019-10-16 21:02:34.021 7f2475d15700 -1 osd.1 2035 heartbeat_check: no reply from 192.168.100.113:6821 osd.8 since back 2019-10-16 20:49:44.711517 front 2019-10-16 20:4 9:44.711470 (oldest deadline 2019-10-16 20:50:07.937387) 2019-10-16 21:04:26.361 7f247b77c700 1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f245d4e4700' had timed out after 60 2019-10-16 21:04:46.225 7f247b77c700 1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f245e4e6700' had timed out after 15 2019-10-16 21:05:01.453 7f247b77c700 1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f245e4e6700' had suicide timed out after 150 2019-10-16 21:34:06.365 7f0a6105cf80 0 set uid:gid to 64045:64045 (ceph:ceph) 2019-10-16 21:34:06.365 7f0a6105cf80 0 ceph version 14.2.4 (75f4de193b3ea58512f204623e6c5a16e6c1e1ba) nautilus (stable), process ceph-osd, pid 203416 2019-10-16 21:34:06.365 7f0a6105cf80 0 pidfile_write: ignore empty --pid-file *osd.4* 2019-10-16 20:53:57.881 7f7ef6010700 -1 osd.4 2035 heartbeat_check: no reply from 192.168.100.112:6813 osd.1 since back 2019-10-16 20:49:41.268550 front 2019-10-16 20:49:41.268546 (oldest deadline 2019-10-16 20:50:07.683860) 2019-10-16 20:55:26.997 7f7ef6010700 -1 osd.4 2035 heartbeat_check: no reply from 192.168.100.113:6805 osd.2 since back 2019-10-16 20:49:44.484184 front 2019-10-16 20:49:44.483961 (oldest deadline 2019-10-16 20:50:07.683860) 2019-10-16 20:56:25.513 7f7ef6010700 -1 osd.4 2035 heartbeat_check: no reply from 192.168.100.111:6821 osd.3 since back 2019-10-16 20:49:41.192151 front 2019-10-16 20:49:41.191856 (oldest deadline 2019-10-16 20:50:07.683860) 2019-10-16 20:57:52.557 7f7ef6010700 -1 osd.4 2035 heartbeat_check: no reply from 192.168.100.113:6813 osd.5 since back 2019-10-16 20:49:44.484278 front 2019-10-16 20:49:47.920080 (oldest deadline 2019-10-16 20:50:07.683860) 2019-10-16 20:59:22.837 7f7ef6010700 -1 osd.4 2035 heartbeat_check: no reply from 192.168.100.111:6805 osd.6 since back 2019-10-16 20:49:41.268846 front 2019-10-16 20:49:41.192139 (oldest deadline 2019-10-16 20:50:07.683860) 2019-10-16 21:01:16.561 7f7ef6010700 -1 osd.4 2035 heartbeat_check: no reply from 192.168.100.112:6821 osd.7 since back 2019-10-16 20:49:41.380152 front 2019-10-16 20:49:41.380178 (oldest deadline 2019-10-16 20:50:07.683860) 2019-10-16 21:02:15.621 7f7ef6010700 -1 osd.4 2035 heartbeat_check: no reply from 192.168.100.113:6821 osd.8 since back 2019-10-16 20:49:43.862767 front 2019-10-16 20:49:44.484153 (oldest deadline 2019-10-16 20:50:07.683860) 2019-10-16 21:04:14.997 7f7efba77700 1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f7edd7df700' had timed out after 60 2019-10-16 21:04:31.689 7f7efba77700 1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f7eddfe0700' had timed out after 60 2019-10-16 21:04:56.025 7f7efba77700 1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f7ede7e1700' had timed out after 60 2019-10-16 21:05:10.409 7f7efc278700 1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f7edd7df700' had timed out after 60 2019-10-16 21:05:20.813 7f7efc278700 1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f7eddfe0700' had timed out after 60 2019-10-16 21:05:34.149 7f7efc278700 1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f7ede7e1700' had timed out after 60 2019-10-16 21:05:48.465 7f7efb276700 1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f7edd7df700' had timed out after 60 2019-10-16 21:06:15.409 7f7efb276700 1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f7eddfe0700' had timed out after 60 2019-10-16 21:06:32.173 7f7efb276700 1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f7ede7e1700' had timed out after 60 2019-10-16 21:21:53.125 7f7ee87f5700 4 rocksdb: [db/db_impl.cc:777] ------- DUMPING STATS ------- 2019-10-16 21:22:15.137 7f7ee87f5700 4 rocksdb: [db/db_impl.cc:778] *osd.7* 2019-10-16 20:53:57.881 7f95aa3fc700 -1 osd.7 2035 heartbeat_check: no reply from 192.168.100.112:6813 osd.1 since back 2019-10-16 20:49:41.380241 front 2019-10-16 20:49:41.380043 (oldest deadline 2019-10-16 20:50:07.683838) 2019-10-16 20:55:26.997 7f95aa3fc700 -1 osd.7 2035 heartbeat_check: no reply from 192.168.100.113:6805 osd.2 since back 2019-10-16 20:49:44.484063 front 2019-10-16 20:49:47.920022 (oldest deadline 2019-10-16 20:50:07.683838) 2019-10-16 20:56:25.513 7f95aa3fc700 -1 osd.7 2035 heartbeat_check: no reply from 192.168.100.111:6821 osd.3 since back 2019-10-16 20:49:41.379978 front 2019-10-16 20:49:41.380217 (oldest deadline 2019-10-16 20:50:07.683838) 2019-10-16 20:57:49.685 7f95aa3fc700 -1 osd.7 2035 heartbeat_check: no reply from 192.168.100.112:6805 osd.4 since back 2019-10-16 20:49:41.380202 front 2019-10-16 20:49:41.380119 (oldest deadline 2019-10-16 20:50:07.683838) 2019-10-16 20:59:18.389 7f95aa3fc700 -1 osd.7 2035 heartbeat_check: no reply from 192.168.100.113:6813 osd.5 since back 2019-10-16 20:49:43.862826 front 2019-10-16 20:49:43.723800 (oldest deadline 2019-10-16 20:50:07.683838) 2019-10-16 21:01:16.561 7f95aa3fc700 -1 osd.7 2035 heartbeat_check: no reply from 192.168.100.111:6805 osd.6 since back 2019-10-16 20:49:41.380229 front 2019-10-16 20:49:41.380023 (oldest deadline 2019-10-16 20:50:07.683838) 2019-10-16 21:02:19.197 7f95aa3fc700 -1 osd.7 2035 heartbeat_check: no reply from 192.168.100.113:6821 osd.8 since back 2019-10-16 20:49:44.484017 front 2019-10-16 20:49:44.484078 (oldest deadline 2019-10-16 20:50:07.683838) 2019-10-16 21:21:53.125 7f959cbe1700 4 rocksdb: [db/db_impl.cc:777] ------- DUMPING STATS ------- 2019-10-16 21:22:20.221 7f959cbe1700 4 rocksdb: [db/db_impl.cc:778] *Journalctl - output* Oct 16 20:53:38 test-node2 systemd-resolved[1005]: Using degraded feature set (TCP) for DNS server 192.168.4.12. Oct 16 20:56:02 test-node2 ceph-osd[1710]: 2019-10-16 20:53:57.881 7f2475d15700 -1 osd.1 2035 heartbeat_check: no reply from 192.168.100.113:6805 osd.2 since back 2019- Oct 16 20:56:29 test-node2 ceph-osd[91197]: 2019-10-16 20:53:57.881 7f95aa3fc700 -1 osd.7 2035 heartbeat_check: no reply from 192.168.100.112:6813 osd.1 since back 2019 Oct 16 20:53:40 test-node2 systemd-resolved[1005]: Using degraded feature set (TCP) for DNS server 192.168.4.7. Oct 16 20:57:17 test-node2 ceph-osd[1713]: 2019-10-16 20:53:57.881 7f7ef6010700 -1 osd.4 2035 heartbeat_check: no reply from 192.168.100.112:6813 osd.1 since back 2019- Oct 16 20:57:17 test-node2 ceph-osd[1713]: 2019-10-16 20:55:26.997 7f7ef6010700 -1 osd.4 2035 heartbeat_check: no reply from 192.168.100.113:6805 osd.2 since back 2019- Oct 16 20:57:59 test-node2 ceph-osd[1710]: 2019-10-16 20:55:26.997 7f2475d15700 -1 osd.1 2035 heartbeat_check: no reply from 192.168.100.111:6821 osd.3 since back 2019- Oct 16 20:57:59 test-node2 ceph-osd[1710]: 2019-10-16 20:56:25.513 7f2475d15700 -1 osd.1 2035 heartbeat_check: no reply from 192.168.100.112:6805 osd.4 since back 2019- Oct 16 20:58:43 test-node2 ceph-osd[91197]: 2019-10-16 20:55:26.997 7f95aa3fc700 -1 osd.7 2035 heartbeat_check: no reply from 192.168.100.113:6805 osd.2 since back 2019 Oct 16 20:58:43 test-node2 ceph-osd[91197]: 2019-10-16 20:56:25.513 7f95aa3fc700 -1 osd.7 2035 heartbeat_check: no reply from 192.168.100.111:6821 osd.3 since back 2019 Oct 16 20:53:51 test-node2 systemd-networkd[907]: enp5s0: Failed to save LLDP data to /run/systemd/netif/lldp/2: No space left on device Oct 16 21:00:13 test-node2 ceph-osd[1713]: 2019-10-16 20:56:25.513 7f7ef6010700 -1 osd.4 2035 heartbeat_check: no reply from 192.168.100.111:6821 osd.3 since back 2019- Oct 16 21:00:13 test-node2 ceph-osd[1713]: 2019-10-16 20:57:52.557 7f7ef6010700 -1 osd.4 2035 heartbeat_check: no reply from 192.168.100.113:6813 osd.5 since back 2019- Oct 16 21:01:42 test-node2 ceph-osd[1710]: 2019-10-16 20:57:52.557 7f2475d15700 -1 osd.1 2035 heartbeat_check: no reply from 192.168.100.113:6813 osd.5 since back 2019- Oct 16 21:01:42 test-node2 ceph-osd[1710]: 2019-10-16 20:59:22.837 7f2475d15700 -1 osd.1 2035 heartbeat_check: no reply from 192.168.100.111:6805 osd.6 since back 2019- Oct 16 21:02:33 test-node2 ceph-osd[91197]: 2019-10-16 20:57:49.685 7f95aa3fc700 -1 osd.7 2035 heartbeat_check: no reply from 192.168.100.112:6805 osd.4 since back 2019 Oct 16 21:02:33 test-node2 ceph-osd[91197]: 2019-10-16 20:59:18.389 7f95aa3fc700 -1 osd.7 2035 heartbeat_check: no reply from 192.168.100.113:6813 osd.5 since back 2019 Oct 16 20:54:24 test-node2 systemd-networkd[907]: enp5s0: Failed to save LLDP data to /run/systemd/netif/lldp/2: No space left on device Oct 16 21:04:08 test-node2 ceph-osd[1713]: 2019-10-16 20:59:22.837 7f7ef6010700 -1 osd.4 2035 heartbeat_check: no reply from 192.168.100.111:6805 osd.6 since back 2019- Oct 16 21:04:08 test-node2 ceph-osd[1713]: 2019-10-16 21:01:16.561 7f7ef6010700 -1 osd.4 2035 heartbeat_check: no reply from 192.168.100.112:6821 osd.7 since back 2019- Oct 16 21:04:54 test-node2 ceph-osd[1710]: 2019-10-16 21:01:18.761 7f2475d15700 -1 osd.1 2035 heartbeat_check: no reply from 192.168.100.112:6821 osd.7 since back 2019- Oct 16 21:04:54 test-node2 ceph-osd[1710]: 2019-10-16 21:02:34.021 7f2475d15700 -1 osd.1 2035 heartbeat_check: no reply from 192.168.100.113:6821 osd.8 since back 2019- Oct 16 21:05:55 test-node2 ceph-osd[91197]: 2019-10-16 21:01:16.561 7f95aa3fc700 -1 osd.7 2035 heartbeat_check: no reply from 192.168.100.111:6805 osd.6 since back 2019 Oct 16 21:05:55 test-node2 ceph-osd[91197]: 2019-10-16 21:02:19.197 7f95aa3fc700 -1 osd.7 2035 heartbeat_check: no reply from 192.168.100.113:6821 osd.8 since back 2019 Oct 16 20:54:56 test-node2 systemd-networkd[907]: enp5s0: Failed to save LLDP data to /run/systemd/netif/lldp/2: No space left on device Oct 16 21:07:37 test-node2 ceph-osd[1713]: 2019-10-16 21:02:15.621 7f7ef6010700 -1 osd.4 2035 heartbeat_check: no reply from 192.168.100.113:6821 osd.8 since back 2019- Oct 16 21:08:12 test-node2 ceph-osd[1710]: *** Caught signal (Aborted) ** Oct 16 21:08:12 test-node2 ceph-osd[1710]: in thread 7f245e4e6700 thread_name:tp_osd_tp Oct 16 21:11:49 test-node2 ceph-osd[1710]: /build/ceph-14.2.4/src/common/HeartbeatMap.cc: In function 'bool ceph::HeartbeatMap::_check(const ceph::heartbeat_handle_d*, Oct 16 21:11:49 test-node2 ceph-osd[1710]: /build/ceph-14.2.4/src/common/HeartbeatMap.cc: 82: ceph_abort_msg("hit suicide timeout") Oct 16 21:12:36 test-node2 ceph-osd[1710]: ceph version 14.2.4 (75f4de193b3ea58512f204623e6c5a16e6c1e1ba) nautilus (stable) Oct 16 21:12:36 test-node2 ceph-osd[1710]: 1: (ceph::__ceph_abort(char const*, int, char const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocato Oct 16 21:34:05 test-node2 systemd[1]: systemd-journald.service: Main process exited, code=dumped, status=6/ABRT Oct 16 21:34:05 test-node2 systemd[1]: systemd-journald.service: Failed with result 'watchdog'. Oct 16 21:34:05 test-node2 systemd[1]: ceph-osd@1.service: Main process exited, code=killed, status=6/ABRT Oct 16 21:34:05 test-node2 systemd[1]: ceph-osd@1.service: Failed with result 'signal'. Oct 16 21:34:05 test-node2 systemd[1]: systemd-journald.service: Service has no hold-off time, scheduling restart. Oct 16 21:34:05 test-node2 systemd[1]: systemd-journald.service: Scheduled restart job, restart counter is at 1. Oct 16 21:34:05 test-node2 systemd[1]: ceph-osd@4.service: Main process exited, code=killed, status=9/KILL Oct 16 21:34:05 test-node2 systemd[1]: ceph-osd@4.service: Failed with result 'signal'. Oct 16 21:34:05 test-node2 systemd[1]: ceph-osd@1.service: Service hold-off time over, scheduling restart. Oct 16 21:34:05 test-node2 systemd[1]: ceph-osd@1.service: Scheduled restart job, restart counter is at 1. Oct 16 21:34:05 test-node2 systemd[1]: Started Ceph object storage daemon osd.4. Oct 16 21:34:05 test-node2 systemd[1]: Started Ceph object storage daemon osd.1. Oct 16 21:34:05 test-node2 systemd[1]: Started Ceph object storage daemon osd.7. Oct 17 06:25:03 test-node2 ceph-osd[203415]: 2019-10-17 06:25:03.652 7f91880ca700 -1 Fail to open '/proc/204405/cmdline' error = (2) No such file or directory Oct 17 06:25:03 test-node2 ceph-osd[203415]: 2019-10-17 06:25:03.680 7f91880ca700 -1 received signal: Hangup from <unknown> (PID: 204405) UID: 0 Oct 17 06:25:03 test-node2 ceph-osd[203415]: 2019-10-17 06:25:03.736 7f91880ca700 -1 received signal: Hangup from pkill -1 -x ceph-mon|ceph-mgr|ceph-mds|ceph-osd|ceph- Oct 17 06:25:04 test-node2 ceph-osd[203417]: 2019-10-17 06:25:03.652 7f2ce27ac700 -1 Fail to open '/proc/204405/cmdline' error = (2) No such file or directory Oct 17 06:25:04 test-node2 ceph-osd[203417]: 2019-10-17 06:25:03.692 7f2ce27ac700 -1 received signal: Hangup from <unknown> (PID: 204405) UID: 0 Oct 17 06:25:04 test-node2 ceph-osd[203417]: 2019-10-17 06:25:03.736 7f2ce27ac700 -1 received signal: Hangup from pkill -1 -x ceph-mon|ceph-mgr|ceph-mds|ceph-osd|ceph- Oct 17 06:25:04 test-node2 ceph-osd[203416]: 2019-10-17 06:25:03.960 7f0a577ab700 -1 Fail to open '/proc/204406/cmdline' error = (2) No such file or directory Oct 17 06:25:04 test-node2 ceph-osd[203416]: 2019-10-17 06:25:03.960 7f0a577ab700 -1 received signal: Hangup from <unknown> (PID: 204406) UID: 0 Oct 17 06:25:04 test-node2 ceph-osd[203416]: 2019-10-17 06:25:03.960 7f0a577ab700 -1 Fail to open '/proc/204406/cmdline' error = (2) No such file or directory Oct 17 06:25:04 test-node2 ceph-osd[203416]: 2019-10-17 06:25:03.960 7f0a577ab700 -1 received signal: Hangup from <unknown> (PID: 204406) UID: 0
_______________________________________________ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io