Yes, this is a consequence of co-locating the MDS and monitors — if the MDS reports to its co-located monitor and both fail, the monitor cluster has to go through its own failure detection and then wait for a full MDS timeout period after that before it marks the MDS down. :(
We might conceivably be able to optimize for this, but there's not a general solution. If you need to co-locate, one thing that would make it better without being a lot of work is trying to have the MDS connect to one of the monitors on a different host. You can do that by just restricting the list of monitors you feed it in the ceph.conf, although it's not a guarantee that will *prevent* it from connecting to its own monitor if there are failures or reconnects after first startup. -Greg On Thu, Aug 30, 2018 at 8:38 AM William Lawton <william.law...@irdeto.com> wrote: > Hi. > > > > We have a 5 node Ceph cluster (refer to ceph -s output at bottom of > email). During resiliency tests we have an occasional problem when we > reboot the active MDS instance and a MON instance together i.e. > dub-sitv-ceph-02 and dub-sitv-ceph-04. We expect the MDS to failover to > the standby instance dub-sitv-ceph-01 which is in standby-replay mode, and > 80% of the time it does with no problems. However, 20% of the time it > doesn’t and the MDS_ALL_DOWN health check is not cleared until 30 seconds > later when the rebooted dub-sitv-ceph-02 and dub-sitv-ceph-04 instances > come back up. > > > > When the MDS successfully fails over to the standby we see in the ceph.log > the following: > > > > 2018-08-25 00:30:02.231811 mon.dub-sitv-ceph-03 mon.0 10.18.53.32:6789/0 > 50 : cluster [ERR] Health check failed: 1 filesystem is offline > (MDS_ALL_DOWN) > > 2018-08-25 00:30:02.237389 mon.dub-sitv-ceph-03 mon.0 10.18.53.32:6789/0 > 52 : cluster [INF] Standby daemon mds.dub-sitv-ceph-01 assigned to > filesystem cephfs as rank 0 > > 2018-08-25 00:30:02.237528 mon.dub-sitv-ceph-03 mon.0 10.18.53.32:6789/0 > 54 : cluster [INF] Health check cleared: MDS_ALL_DOWN (was: 1 filesystem is > offline) > > > > When the active MDS role does not failover to the standby the MDS_ALL_DOWN > check is not cleared until after the rebooted instances have come back up > e.g.: > > > > 2018-08-25 03:30:02.936554 mon.dub-sitv-ceph-03 mon.0 10.18.53.32:6789/0 > 55 : cluster [ERR] Health check failed: 1 filesystem is offline > (MDS_ALL_DOWN) > > 2018-08-25 03:30:04.235703 mon.dub-sitv-ceph-05 mon.2 10.18.186.208:6789/0 > 226 : cluster [INF] mon.dub-sitv-ceph-05 calling monitor election > > 2018-08-25 03:30:04.238672 mon.dub-sitv-ceph-03 mon.0 10.18.53.32:6789/0 > 56 : cluster [INF] mon.dub-sitv-ceph-03 calling monitor election > > 2018-08-25 03:30:09.242595 mon.dub-sitv-ceph-03 mon.0 10.18.53.32:6789/0 > 57 : cluster [INF] mon.dub-sitv-ceph-03 is new leader, mons > dub-sitv-ceph-03,dub-sitv-ceph-05 in quorum (ranks 0,2) > > 2018-08-25 03:30:09.252804 mon.dub-sitv-ceph-03 mon.0 10.18.53.32:6789/0 > 62 : cluster [WRN] Health check failed: 1/3 mons down, quorum > dub-sitv-ceph-03,dub-sitv-ceph-05 (MON_DOWN) > > 2018-08-25 03:30:09.258693 mon.dub-sitv-ceph-03 mon.0 10.18.53.32:6789/0 > 63 : cluster [WRN] overall HEALTH_WARN 2 osds down; 2 hosts (2 osds) down; > 1/3 mons down, quorum dub-sitv-ceph-03,dub-sitv-ceph-05 > > 2018-08-25 03:30:10.254162 mon.dub-sitv-ceph-03 mon.0 10.18.53.32:6789/0 > 64 : cluster [WRN] Health check failed: Reduced data availability: 2 pgs > inactive, 115 pgs peering (PG_AVAILABILITY) > > 2018-08-25 03:30:12.429145 mon.dub-sitv-ceph-03 mon.0 10.18.53.32:6789/0 > 66 : cluster [WRN] Health check failed: Degraded data redundancy: 712/2504 > objects degraded (28.435%), 86 pgs degraded (PG_DEGRADED) > > 2018-08-25 03:30:16.137408 mon.dub-sitv-ceph-03 mon.0 10.18.53.32:6789/0 > 67 : cluster [WRN] Health check update: Reduced data availability: 1 pg > inactive, 69 pgs peering (PG_AVAILABILITY) > > 2018-08-25 03:30:17.193322 mon.dub-sitv-ceph-03 mon.0 10.18.53.32:6789/0 > 68 : cluster [INF] Health check cleared: PG_AVAILABILITY (was: Reduced data > availability: 1 pg inactive, 69 pgs peering) > > 2018-08-25 03:30:18.432043 mon.dub-sitv-ceph-03 mon.0 10.18.53.32:6789/0 > 69 : cluster [WRN] Health check update: Degraded data redundancy: 1286/2572 > objects degraded (50.000%), 166 pgs degraded (PG_DEGRADED) > > 2018-08-25 03:30:26.139491 mon.dub-sitv-ceph-03 mon.0 10.18.53.32:6789/0 > 71 : cluster [WRN] Health check update: Degraded data redundancy: 1292/2584 > objects degraded (50.000%), 166 pgs degraded (PG_DEGRADED) > > 2018-08-25 03:30:31.355321 mon.dub-sitv-ceph-04 mon.1 10.18.53.155:6789/0 > 1 : cluster [INF] mon.dub-sitv-ceph-04 calling monitor election > > 2018-08-25 03:30:31.371519 mon.dub-sitv-ceph-04 mon.1 10.18.53.155:6789/0 > 2 : cluster [WRN] message from mon.0 was stamped 0.817433s in the future, > clocks not synchronized > > 2018-08-25 03:30:32.175677 mon.dub-sitv-ceph-03 mon.0 10.18.53.32:6789/0 > 72 : cluster [INF] mon.dub-sitv-ceph-03 calling monitor election > > 2018-08-25 03:30:32.175864 mon.dub-sitv-ceph-05 mon.2 10.18.186.208:6789/0 > 227 : cluster [INF] mon.dub-sitv-ceph-05 calling monitor election > > 2018-08-25 03:30:32.180615 mon.dub-sitv-ceph-03 mon.0 10.18.53.32:6789/0 > 73 : cluster [INF] mon.dub-sitv-ceph-03 is new leader, mons > dub-sitv-ceph-03,dub-sitv-ceph-04,dub-sitv-ceph-05 in quorum (ranks 0,1,2) > > 2018-08-25 03:30:32.189593 mon.dub-sitv-ceph-03 mon.0 10.18.53.32:6789/0 > 78 : cluster [INF] Health check cleared: MON_DOWN (was: 1/3 mons down, > quorum dub-sitv-ceph-03,dub-sitv-ceph-05) > > 2018-08-25 03:30:32.190820 mon.dub-sitv-ceph-03 mon.0 10.18.53.32:6789/0 > 79 : cluster [WRN] mon.1 10.18.53.155:6789/0 clock skew 0.811318s > max > 0.05s > > 2018-08-25 03:30:32.194280 mon.dub-sitv-ceph-03 mon.0 10.18.53.32:6789/0 > 80 : cluster [WRN] overall HEALTH_WARN 2 osds down; 2 hosts (2 osds) down; > Degraded data redundancy: 1292/2584 objects degraded (50.000%), 166 pgs > degraded > > 2018-08-25 03:30:35.076121 mon.dub-sitv-ceph-03 mon.0 10.18.53.32:6789/0 > 83 : cluster [INF] daemon mds.dub-sitv-ceph-02 restarted > > 2018-08-25 03:30:35.270222 mon.dub-sitv-ceph-03 mon.0 10.18.53.32:6789/0 > 85 : cluster [WRN] Health check failed: 1 filesystem is degraded > (FS_DEGRADED) > > 2018-08-25 03:30:35.270267 mon.dub-sitv-ceph-03 mon.0 10.18.53.32:6789/0 > 86 : cluster [ERR] Health check failed: 1 filesystem is offline > (MDS_ALL_DOWN) > > 2018-08-25 03:30:35.282139 mon.dub-sitv-ceph-03 mon.0 10.18.53.32:6789/0 > 88 : cluster [INF] Standby daemon mds.dub-sitv-ceph-01 assigned to > filesystem cephfs as rank 0 > > 2018-08-25 03:30:35.282268 mon.dub-sitv-ceph-03 mon.0 10.18.53.32:6789/0 > 89 : cluster [INF] Health check cleared: MDS_ALL_DOWN (was: 1 filesystem is > offline) > > > > In the MDS log we’ve noticed that when the issue occurs, at precisely the > time when the active MDS/MON nodes are rebooted, the standby MDS instance > briefly stops logging replay_done (as standby). This is shown in the log > exert below where there is a 9s gap in these logs. > > > > 2018-08-25 03:30:00.085 7f3ab9b00700 1 mds.0.0 replay_done (as standby) > > 2018-08-25 03:30:01.091 7f3ab9b00700 1 mds.0.0 replay_done (as standby) > > 2018-08-25 03:30:10.332 7f3ab9b00700 1 mds.0.0 replay_done (as standby) > > 2018-08-25 03:30:11.333 7f3abb303700 1 mds.0.0 replay_done (as standby) > > > > I’ve tried to reproduce the issue by rebooting each MDS instance in turn > repeatedly 5 minutes apart but so far haven’t been able to do so, so my > assumption is that rebooting the MDS and a MON instance at the same time is > a significant factor. > > > > Our mds_standby* configuration is set as follows: > > > > "mon_force_standby_active": "true", > > "mds_standby_for_fscid": "-1", > > "mds_standby_for_name": "", > > "mds_standby_for_rank": "0", > > "mds_standby_replay": "true", > > > > The cluster status is as follows: > > > > cluster: > > id: f774b9b2-d514-40d9-85ab-d0389724b6c0 > > health: HEALTH_OK > > > > services: > > mon: 3 daemons, quorum > dub-sitv-ceph-03,dub-sitv-ceph-04,dub-sitv-ceph-05 > > mgr: dub-sitv-ceph-04(active), standbys: dub-sitv-ceph-03, > dub-sitv-ceph-05 > > mds: cephfs-1/1/1 up {0=dub-sitv-ceph-02=up:active}, 1 > up:standby-replay > > osd: 4 osds: 4 up, 4 in > > > > data: > > pools: 2 pools, 200 pgs > > objects: 554 objects, 980 MiB > > usage: 7.9 GiB used, 1.9 TiB / 2.0 TiB avail > > pgs: 200 active+clean > > > > io: > > client: 1.5 MiB/s rd, 810 KiB/s wr, 286 op/s rd, 218 op/s wr > > > > Hope someone can help! > > *William Lawton* > > > _______________________________________________ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >
_______________________________________________ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com