On Thu, Oct 4, 2018 at 3:58 PM Stefan Kooman <ste...@bit.nl> wrote: > Dear list, > > Today we hit our first Ceph MDS issue. Out of the blue the active MDS > stopped working: > > mon.mon1 [WRN] daemon mds.mds1 is not responding, replacing it as rank 0 > with standby > daemon mds.mds2. > > Logging of ceph-mds1: > > 2018-10-04 10:50:08.524745 7fdd516bf700 1 mds.mds1 asok_command: status > (starting...) > 2018-10-04 10:50:08.524782 7fdd516bf700 1 mds.mds1 asok_command: status > (complete) > > ^^ one of our monitoring health checks performing a "ceph daemon mds.mds1 > version", business as usual. > > 2018-10-04 10:52:36.712525 7fdd51ec0700 1 heartbeat_map is_healthy > 'MDSRank' had timed out after 15 > 2018-10-04 10:52:36.747577 7fdd4deb8700 1 heartbeat_map is_healthy > 'MDSRank' had timed out after 15 > 2018-10-04 10:52:36.747584 7fdd4deb8700 1 mds.beacon.mds1 _send skipping > beacon, heartbeat map not healthy > > ^^ the unresponsive mds1 consumes 100% CPU and keeps on logging above > heatbeat_map messages. > > In the meantime ceph-mds2 has transitioned from "standby-replay" to > "active": > > mon.mon1 [INF] daemon mds.mds2 is now active in filesystem > BITED-153874-cephfs as rank 0 > > Logging: > > replays, final replay as standby, reopen log > > 2018-10-04 10:52:53.268470 7fdb231d9700 1 mds.0.141 reconnect_done > 2018-10-04 10:52:53.759844 7fdb231d9700 1 mds.mds2 Updating MDS map to > version 143 from mon.3 > 2018-10-04 10:52:53.759859 7fdb231d9700 1 mds.0.141 handle_mds_map i am > now mds.0.141 > 2018-10-04 10:52:53.759862 7fdb231d9700 1 mds.0.141 handle_mds_map state > change up:reconnect --> up:rejoin > 2018-10-04 10:52:53.759868 7fdb231d9700 1 mds.0.141 rejoin_start > 2018-10-04 10:52:53.759970 7fdb231d9700 1 mds.0.141 rejoin_joint_start > 2018-10-04 10:52:53.760970 7fdb1d1cd700 0 mds.0.cache failed to open ino > 0x10000cd95e9 err -5/0 > 2018-10-04 10:52:54.126658 7fdb1d1cd700 1 mds.0.141 rejoin_done > 2018-10-04 10:52:54.770457 7fdb231d9700 1 mds.mds2 Updating MDS map to > version 144 from mon.3 > 2018-10-04 10:52:54.770484 7fdb231d9700 1 mds.0.141 handle_mds_map i am > now mds.0.141 > 2018-10-04 10:52:54.770487 7fdb231d9700 1 mds.0.141 handle_mds_map state > change up:rejoin --> up:clientreplay > 2018-10-04 10:52:54.770494 7fdb231d9700 1 mds.0.141 recovery_done -- > successful recovery! > 2018-10-04 10:52:54.770617 7fdb231d9700 1 mds.0.141 clientreplay_start > 2018-10-04 10:52:54.882995 7fdb1d1cd700 1 mds.0.141 clientreplay_done > 2018-10-04 10:52:55.778598 7fdb231d9700 1 mds.mds2 Updating MDS map to > version 145 from mon.3 > 2018-10-04 10:52:55.778622 7fdb231d9700 1 mds.0.141 handle_mds_map i am > now mds.0.141 > 2018-10-04 10:52:55.778628 7fdb231d9700 1 mds.0.141 handle_mds_map state > change up:clientreplay --> up:active > 2018-10-04 10:52:55.778638 7fdb231d9700 1 mds.0.141 active_start > 2018-10-04 10:52:55.805206 7fdb231d9700 1 mds.0.141 cluster recovered. > > And then it _also_ starts to log hearbeat_map messages (and consuming 100% > CPU): > > en dan deze meldingen die zichzelf blijven herhalen bij 100% cpu > 2018-10-04 10:53:41.550793 7fdb241db700 1 heartbeat_map is_healthy > 'MDSRank' had timed out after 15 > 2018-10-04 10:53:42.884018 7fdb201d3700 1 heartbeat_map is_healthy > 'MDSRank' had timed out after 15 > 2018-10-04 10:53:42.884024 7fdb201d3700 1 mds.beacon.mds2 _send skipping > beacon, heartbeat map not healthy > > At that point in time there is one active MDS according to ceph, but in > reality it's > not functioning correctly (not serving clients at least). > > ... we stopped both daemons. Restarted one ... recovery ... > works for half a minute ... then starts logging heartbeat_map messages. > Restart again ... works for a little while ... starts logging > heartbeat_map messages again. We restart the mds with debug_mds=20 .... > it keeps working fine. The other mds gets restarted and keeps on > working. We do a couple of failover tests .... works flawlessly > (failover in < 1 second, clients reconnect instantly). > > A couple of hours later we hit the same issue. We restarted with > debug_mds=20 and debug_journaler=20 on the standby-replay node. Eight > hours later (an hour ago) we hit the same issue. We captured ~ 4.7 GB of > logging.... I skipped to the end of the log file just before the > "hearbeat_map" messages start: > > 2018-10-04 23:23:53.144644 7f415ebf4700 20 mds.0.locker client.17079146 > pending pAsLsXsFscr allowed pAsLsXsFscr wanted pFscr > 2018-10-04 23:23:53.144645 7f415ebf4700 10 mds.0.locker eval done > 2018-10-04 23:23:55.088542 7f415bbee700 10 mds.beacon.mds2 _send up:active > seq 5021 > 2018-10-04 23:23:59.088602 7f415bbee700 10 mds.beacon.mds2 _send up:active > seq 5022 > 2018-10-04 23:24:03.088688 7f415bbee700 10 mds.beacon.mds2 _send up:active > seq 5023 > 2018-10-04 23:24:07.088775 7f415bbee700 10 mds.beacon.mds2 _send up:active > seq 5024 > 2018-10-04 23:24:11.088867 7f415bbee700 1 heartbeat_map is_healthy > 'MDSRank' had timed out after 15 > 2018-10-04 23:24:11.088871 7f415bbee700 1 mds.beacon.mds2 _send skipping > beacon, heartbeat map not healthy > > As far as I can see just normal behaviour. > > The big question is: what is happening when the mds start logging the > hearbeat_map messages? > Why does it log "heartbeat_map is_healthy", just to log .000004 seconds > later it's not healthy? >
Ah, there's a misunderstanding here — the output isn't terribly clear. "is_healthy" is the name of a *function* in the source code. The line heartbeat_map is_healthy 'MDSRank' had timed out after 15 is telling you that the heartbeat_map's is_healthy function is running, and it finds that "'MDSRank' had timed out after 15 [seconds]". So the thread MDSRank is *not* healthy, it didn't check in for 15 seconds! Therefore the MDS beacon code decides not to send a beacon, because it thinks the MDS might be stuck. >From what you've described here, it's most likely that the MDS is trying to read something out of RADOS which is taking a long time, and which we didn't expect to cause a slow down. You can check via the admin socket to see if there are outstanding Objecter requests or ops_in_flight to get a clue. -Greg > > Ceph version: 12.2.8 on all nodes (mon, osd, mds) > mds: one active / one standby-replay > > The system was not under any kind of resource pressure: plenty of CPU, RAM > available. Metrics all look normal up to the moment things go into a > deadlock > (so it seems). > > Thanks for any hints / clue. > > Gr. Stefan > > > > -- > | BIT BV http://www.bit.nl/ Kamer van Koophandel 09090351 > | GPG: 0xD14839C6 +31 318 648 688 > <+31%20318%20648%20688> / i...@bit.nl > _______________________________________________ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >
_______________________________________________ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com