[ceph-users] MDS HA failover

Luke Weber Wed, 08 Feb 2017 13:47:20 -0800

Playing around with mds with a hot standby on kraken. When I fail out the
active mds manually it switches correctly to the standby i.e. ceph mds fail
<active-mds>


Noticed that when I have two mds servers and I shutdown the active mds
server it takes 5 minutes for the standby relay to become active(Seems it's
20 retries at 15 seconds timeout to the previously active mds). I can't
fail the active mds though as it's already been removed from the mds map,
but the hot standby is stuck in replay mode for 5 minutes waiting for the
active before it gives up and becomes active. Curious if there's a
preferred way to configure this behavior or force a failover in the event
of unexpected active failure.

*MSD log of standby becoming master:*

2017-02-08 17:25:54.151002 7fa0a1502700  1 mds.0.0 replay_done (as standby)
2017-02-08 17:25:55.153022 7fa0a1502700  1 mds.0.0 replay_done (as standby)
2017-02-08 17:25:56.154928 7fa0a1502700  1 mds.0.0 replay_done (as standby)
2017-02-08 17:25:57.156771 7fa0a1502700  1 mds.0.0 replay_done (as standby)
2017-02-08 17:25:58.158700 7fa0a1502700  1 mds.0.0 replay_done (as standby)
*----- Shutdown active mds (Start to see it reconnecting to active server):*
2017-02-08 17:26:08.774979 7fa0a9483700  0 -- 172.20.1.139:6800/255206595
>> - conn(0x7fa0baad6800 :6800 s=STATE_ACCEPTING_WAIT_BANNER_ADDR pgs=0
cs=0 l=0).fault with nothing to send and in the half  accept state just
closed
2017-02-08 17:26:23.775456 7fa0a9483700  0 -- 172.20.1.139:6800/255206595
>> - conn(0x7fa0baad5000 :6800 s=STATE_ACCEPTING_WAIT_BANNER_ADDR pgs=0
cs=0 l=0).fault with nothing to send and in the half  accept state just
closed
*----- 15 Second grace to get an mds map update (**mds beacon grace=15)*
2017-02-08 17:26:25.003332 7fa0a650c700  1 mds.0.132 handle_mds_map i am
now mds.0.132
2017-02-08 17:26:25.003340 7fa0a650c700  1 mds.0.132 handle_mds_map state
change up:standby-replay --> up:replay
2017-02-08 17:26:38.776036 7fa0a9483700  0 -- 172.20.1.139:6800/255206595
>> - conn(0x7fa0baad3800 :6800 s=STATE_ACCEPTING_WAIT_BANNER_ADDR pgs=0
cs=0 l=0).fault with nothing to send and in the half  accept state just
closed
2017-02-08 17:26:53.776916 7fa0a9483700  0 -- 172.20.1.139:6800/255206595
>> - conn(0x7fa0baad6800 :6800 s=STATE_ACCEPTING_WAIT_BANNER_ADDR pgs=0
cs=0 l=0).fault with nothing to send and in the half  accept state just
closed
2017-02-08 17:27:08.777962 7fa0a9483700  0 -- 172.20.1.139:6800/255206595
>> - conn(0x7fa0baad5000 :6800 s=STATE_ACCEPTING_WAIT_BANNER_ADDR pgs=0
cs=0 l=0).fault with nothing to send and in the half  accept state just
closed
2017-02-08 17:27:23.777884 7fa0a9483700  0 -- 172.20.1.139:6800/255206595
>> - conn(0x7fa0b82d3800 :6800 s=STATE_ACCEPTING_WAIT_BANNER_ADDR pgs=0
cs=0 l=0).fault with nothing to send and in the half  accept state just
closed
2017-02-08 17:27:38.778943 7fa0a9483700  0 -- 172.20.1.139:6800/255206595
>> - conn(0x7fa0b82d2000 :6800 s=STATE_ACCEPTING_WAIT_BANNER_ADDR pgs=0
cs=0 l=0).fault with nothing to send and in the half  accept state just
closed
2017-02-08 17:27:53.779926 7fa0a9483700  0 -- 172.20.1.139:6800/255206595
>> - conn(0x7fa0b8316800 :6800 s=STATE_ACCEPTING_WAIT_BANNER_ADDR pgs=0
cs=0 l=0).fault with nothing to send and in the half  accept state just
closed
2017-02-08 17:28:08.780927 7fa0a9483700  0 -- 172.20.1.139:6800/255206595
>> - conn(0x7fa0baad6800 :6800 s=STATE_ACCEPTING_WAIT_BANNER_ADDR pgs=0
cs=0 l=0).fault with nothing to send and in the half  accept state just
closed
2017-02-08 17:28:23.780909 7fa0a9483700  0 -- 172.20.1.139:6800/255206595
>> - conn(0x7fa0baad5000 :6800 s=STATE_ACCEPTING_WAIT_BANNER_ADDR pgs=0
cs=0 l=0).fault with nothing to send and in the half  accept state just
closed
2017-02-08 17:28:38.781947 7fa0a9483700  0 -- 172.20.1.139:6800/255206595
>> - conn(0x7fa0b82d3800 :6800 s=STATE_ACCEPTING_WAIT_BANNER_ADDR pgs=0
cs=0 l=0).fault with nothing to send and in the half  accept state just
closed
2017-02-08 17:28:53.782075 7fa0a9483700  0 -- 172.20.1.139:6800/255206595
>> - conn(0x7fa0b82d2000 :6800 s=STATE_ACCEPTING_WAIT_BANNER_ADDR pgs=0
cs=0 l=0).fault with nothing to send and in the half  accept state just
closed
2017-02-08 17:29:08.782916 7fa0a9483700  0 -- 172.20.1.139:6800/255206595
>> - conn(0x7fa0b8315000 :6800 s=STATE_ACCEPTING_WAIT_BANNER_ADDR pgs=0
cs=0 l=0).fault with nothing to send and in the half  accept state just
closed
2017-02-08 17:29:23.783476 7fa0a9483700  0 -- 172.20.1.139:6800/255206595
>> - conn(0x7fa0b8315000 :6800 s=STATE_ACCEPTING_WAIT_BANNER_ADDR pgs=0
cs=0 l=0).fault with nothing to send and in the half  accept state just
closed
2017-02-08 17:29:38.784445 7fa0a9483700  0 -- 172.20.1.139:6800/255206595
>> - conn(0x7fa0baad6800 :6800 s=STATE_ACCEPTING_WAIT_BANNER_ADDR pgs=0
cs=0 l=0).fault with nothing to send and in the half  accept state just
closed
2017-02-08 17:29:53.784934 7fa0a9483700  0 -- 172.20.1.139:6800/255206595
>> - conn(0x7fa0baad5000 :6800 s=STATE_ACCEPTING_WAIT_BANNER_ADDR pgs=0
cs=0 l=0).fault with nothing to send and in the half  accept state just
closed
2017-02-08 17:30:08.785959 7fa0a9483700  0 -- 172.20.1.139:6800/255206595
>> - conn(0x7fa0b82d3800 :6800 s=STATE_ACCEPTING_WAIT_BANNER_ADDR pgs=0
cs=0 l=0).fault with nothing to send and in the half  accept state just
closed
2017-02-08 17:30:23.786921 7fa0a9483700  0 -- 172.20.1.139:6800/255206595
>> - conn(0x7fa0b82d2000 :6800 s=STATE_ACCEPTING_WAIT_BANNER_ADDR pgs=0
cs=0 l=0).fault with nothing to send and in the half  accept state just
closed
2017-02-08 17:30:38.786923 7fa0a9483700  0 -- 172.20.1.139:6800/255206595
>> - conn(0x7fa0baad6800 :6800 s=STATE_ACCEPTING_WAIT_BANNER_ADDR pgs=0
cs=0 l=0).fault with nothing to send and in the half  accept state just
closed
2017-02-08 17:30:53.788035 7fa0a9483700  0 -- 172.20.1.139:6800/255206595
>> - conn(0x7fa0baad5000 :6800 s=STATE_ACCEPTING_WAIT_BANNER_ADDR pgs=0
cs=0 l=0).fault with nothing to send and in the half  accept state just
closed
2017-02-08 17:31:08.788730 7fa0a9483700  0 -- 172.20.1.139:6800/255206595
>> - conn(0x7fa0b8315000 :6800 s=STATE_ACCEPTING_WAIT_BANNER_ADDR pgs=0
cs=0 l=0).fault with nothing to send and in the half  accept state just
closed
[2017-02-08 17:31:15.393349 7fa0a1502700  1 mds.0.132 replay_done (as
standby)
2017-02-08 17:31:15.393353 7fa0a1502700  1 mds.0.132 standby_replay_restart
(final takeover pass)
2017-02-08 17:31:15.397825 7fa0a1502700  1 mds.0.132 replay_done
2017-02-08 17:31:15.397832 7fa0a1502700  1 mds.0.132 making mds journal
writeable
2017-02-08 17:31:16.163297 7fa0a650c700  1 mds.0.132 handle_mds_map i am
now mds.0.132
2017-02-08 17:31:16.163303 7fa0a650c700  1 mds.0.132 handle_mds_map state
change up:replay --> up:reconnect
2017-02-08 17:31:16.163312 7fa0a650c700  1 mds.0.132 reconnect_start
2017-02-08 17:31:16.163314 7fa0a650c700  1 mds.0.132 reopen_log

_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] MDS HA failover

Reply via email to