> It's mds_beacon_grace. Set that on the monitor to control the replacement of
> laggy MDS daemons,
Sounds like William's issue is something else. William shuts down MDS 2 and
MON 4 simultaneously. The log shows that some time later (we don't know how
long), MON 3 detects that MDS 2 is gone ("MDS_ALL_DOWN"), but does nothing
about it until 30 seconds later, which happens to be when MDS 2 and MON 4 come
back. At that point, MON 3 reports that the rank has been reassigned to MDS
1.
'mds_beacon_grace' determines when a monitor declares MDS_ALL_DOWN, right?
I think if things are working as designed, the log should show MON 3
reassigning the rank to MDS 1 immediately after it reports MDS 2 is gone.
>From the original post:
2018-08-25 03:30:02.936554 mon.dub-sitv-ceph-03 mon.0 10.18.53.32:6789/0 55 :
cluster [ERR] Health check failed: 1 filesystem is offline (MDS_ALL_DOWN)
2018-08-25 03:30:04.235703 mon.dub-sitv-ceph-05 mon.2 10.18.186.208:6789/0 226
: cluster [INF] mon.dub-sitv-ceph-05 calling monitor election
2018-08-25 03:30:04.238672 mon.dub-sitv-ceph-03 mon.0 10.18.53.32:6789/0 56 :
cluster [INF] mon.dub-sitv-ceph-03 calling monitor election
2018-08-25 03:30:09.242595 mon.dub-sitv-ceph-03 mon.0 10.18.53.32:6789/0 57 :
cluster [INF] mon.dub-sitv-ceph-03 is new leader, mons
dub-sitv-ceph-03,dub-sitv-ceph-05 in quorum (ranks 0,2)
2018-08-25 03:30:09.252804 mon.dub-sitv-ceph-03 mon.0 10.18.53.32:6789/0 62 :
cluster [WRN] Health check failed: 1/3 mons down, quorum
dub-sitv-ceph-03,dub-sitv-ceph-05 (MON_DOWN)
2018-08-25 03:30:09.258693 mon.dub-sitv-ceph-03 mon.0 10.18.53.32:6789/0 63 :
cluster [WRN] overall HEALTH_WARN 2 osds down; 2 hosts (2 osds) down; 1/3 mons
down, quorum dub-sitv-ceph-03,dub-sitv-ceph-05
2018-08-25 03:30:10.254162 mon.dub-sitv-ceph-03 mon.0 10.18.53.32:6789/0 64 :
cluster [WRN] Health check failed: Reduced data availability: 2 pgs inactive,
115 pgs peering (PG_AVAILABILITY)
2018-08-25 03:30:12.429145 mon.dub-sitv-ceph-03 mon.0 10.18.53.32:6789/0 66 :
cluster [WRN] Health check failed: Degraded data redundancy: 712/2504 objects
degraded (28.435%), 86 pgs degraded (PG_DEGRADED)
2018-08-25 03:30:16.137408 mon.dub-sitv-ceph-03 mon.0 10.18.53.32:6789/0 67 :
cluster [WRN] Health check update: Reduced data availability: 1 pg inactive, 69
pgs peering (PG_AVAILABILITY)
2018-08-25 03:30:17.193322 mon.dub-sitv-ceph-03 mon.0 10.18.53.32:6789/0 68 :
cluster [INF] Health check cleared: PG_AVAILABILITY (was: Reduced data
availability: 1 pg inactive, 69 pgs peering)
2018-08-25 03:30:18.432043 mon.dub-sitv-ceph-03 mon.0 10.18.53.32:6789/0 69 :
cluster [WRN] Health check update: Degraded data redundancy: 1286/2572 objects
degraded (50.000%), 166 pgs degraded (PG_DEGRADED)
2018-08-25 03:30:26.139491 mon.dub-sitv-ceph-03 mon.0 10.18.53.32:6789/0 71 :
cluster [WRN] Health check update: Degraded data redundancy: 1292/2584 objects
degraded (50.000%), 166 pgs degraded (PG_DEGRADED)
2018-08-25 03:30:31.355321 mon.dub-sitv-ceph-04 mon.1 10.18.53.155:6789/0 1 :
cluster [INF] mon.dub-sitv-ceph-04 calling monitor election
2018-08-25 03:30:31.371519 mon.dub-sitv-ceph-04 mon.1 10.18.53.155:6789/0 2 :
cluster [WRN] message from mon.0 was stamped 0.817433s in the future, clocks
not synchronized
2018-08-25 03:30:32.175677 mon.dub-sitv-ceph-03 mon.0 10.18.53.32:6789/0 72 :
cluster [INF] mon.dub-sitv-ceph-03 calling monitor election
2018-08-25 03:30:32.175864 mon.dub-sitv-ceph-05 mon.2 10.18.186.208:6789/0 227
: cluster [INF] mon.dub-sitv-ceph-05 calling monitor election
2018-08-25 03:30:32.180615 mon.dub-sitv-ceph-03 mon.0 10.18.53.32:6789/0 73 :
cluster [INF] mon.dub-sitv-ceph-03 is new leader, mons
dub-sitv-ceph-03,dub-sitv-ceph-04,dub-sitv-ceph-05 in quorum (ranks 0,1,2)
2018-08-25 03:30:32.189593 mon.dub-sitv-ceph-03 mon.0 10.18.53.32:6789/0 78 :
cluster [INF] Health check cleared: MON_DOWN (was: 1/3 mons down, quorum
dub-sitv-ceph-03,dub-sitv-ceph-05)
2018-08-25 03:30:32.190820 mon.dub-sitv-ceph-03 mon.0 10.18.53.32:6789/0 79 :
cluster [WRN] mon.1 10.18.53.155:6789/0 clock skew 0.811318s > max 0.05s
2018-08-25 03:30:32.194280 mon.dub-sitv-ceph-03 mon.0 10.18.53.32:6789/0 80 :
cluster [WRN] overall HEALTH_WARN 2 osds down; 2 hosts (2 osds) down; Degraded
data redundancy: 1292/2584 objects degraded (50.000%), 166 pgs degraded
2018-08-25 03:30:35.076121 mon.dub-sitv-ceph-03 mon.0 10.18.53.32:6789/0 83 :
cluster [INF] daemon mds.dub-sitv-ceph-02 restarted
2018-08-25 03:30:35.270222 mon.dub-sitv-ceph-03 mon.0 10.18.53.32:6789/0 85 :
cluster [WRN] Health check failed: 1 filesystem is degraded (FS_DEGRADED)
2018-08-25 03:30:35.270267 mon.dub-sitv-ceph-03 mon.0 10.18.53.32:6789/0 86 :
cluster [ERR] Health check failed: 1 filesystem is offline (MDS_ALL_DOWN)
2018-08-25 03:30:35.282139 mon.dub-sitv-ceph-03 mon.0 10.18.53.32:6789/0 88 :
cluster [INF] Standby daemon mds.dub-sitv-ceph-01 assigned to filesystem cephfs
as rank 0
2018-08-25 03:30:35.282268 mon.dub-sitv-ceph-03 mon.0 10.18.53.32:6789/0 89 :
cluster [INF] Health check cleared: MDS_ALL_DOWN (was: 1 filesystem is offline)
--
Bryan Henderson San Jose, California
_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com