Re: [ceph-users] MDS does not always failover to hot standby on reboot

William Lawton Thu, 30 Aug 2018 12:46:51 -0700

Oh i see. We’d taken steps to reduce the risk of losing the active mds and mon 
leader instances at the same time in the hope that it would prevent this issue. 
Do you know if the mds always connects to a specific mon instance i.e. the mon 
provider and can it be determined which mon instance that is? Or is it adhoc?


Sent from my iPhone

On 30 Aug 2018, at 20:01, Gregory Farnum 
<[email protected]<mailto:[email protected]>> wrote:

Okay, well that will be the same reason then. If the active MDS is connectedng  
to a monitor and they fail at the same time, the monitors can’t replace the mds 
until they’ve been through their own election and a full mds timeout window.
On Thu, Aug 30, 2018 at 11:46 AM William Lawton 
<[email protected]<mailto:[email protected]>> wrote:
Thanks for the response Greg. We did originally have co-located mds and mon but 
realised this wasn't a good idea early on and separated them out onto different 
hosts. So our mds hosts are on ceph-01 and ceph-02, and our mon hosts are on 
ceph-03, 04 and 05. Unfortunately we see this issue occurring when we reboot 
ceph-02(mds) and ceph-04(mon) together. We expect ceph-01 to become the active 
mds but often it doesnt.

Sent from my iPhone

On 30 Aug 2018, at 17:46, Gregory Farnum 
<[email protected]<mailto:[email protected]>> wrote:

Yes, this is a consequence of co-locating the MDS and monitors — if the MDS 
reports to its co-located monitor and both fail, the monitor cluster has to go 
through its own failure detection and then wait for a full MDS timeout period 
after that before it marks the MDS down. :(

We might conceivably be able to optimize for this, but there's not a general 
solution. If you need to co-locate, one thing that would make it better without 
being a lot of work is trying to have the MDS connect to one of the monitors on 
a different host. You can do that by just restricting the list of monitors you 
feed it in the ceph.conf, although it's not a guarantee that will *prevent* it 
from connecting to its own monitor if there are failures or reconnects after 
first startup.
-Greg

On Thu, Aug 30, 2018 at 8:38 AM William Lawton 
<[email protected]<mailto:[email protected]>> wrote:
Hi.

We have a 5 node Ceph cluster (refer to ceph -s output at bottom of email). 
During resiliency tests we have an occasional problem when we reboot the active 
MDS instance and a MON instance together i.e.  dub-sitv-ceph-02 and 
dub-sitv-ceph-04. We expect the MDS to failover to the standby instance 
dub-sitv-ceph-01 which is in standby-replay mode, and 80% of the time it does 
with no problems. However, 20% of the time it doesn’t and the MDS_ALL_DOWN 
health check is not cleared until 30 seconds later when the rebooted 
dub-sitv-ceph-02 and dub-sitv-ceph-04 instances come back up.

When the MDS successfully fails over to the standby we see in the ceph.log the 
following:

2018-08-25 00:30:02.231811 mon.dub-sitv-ceph-03 mon.0 
10.18.53.32:6789/0<http://10.18.53.32:6789/0> 50 : cluster [ERR] Health check 
failed: 1 filesystem is offline (MDS_ALL_DOWN)
2018-08-25 00:30:02.237389 mon.dub-sitv-ceph-03 mon.0 
10.18.53.32:6789/0<http://10.18.53.32:6789/0> 52 : cluster [INF] Standby daemon 
mds.dub-sitv-ceph-01 assigned to filesystem cephfs as rank 0
2018-08-25 00:30:02.237528 mon.dub-sitv-ceph-03 mon.0 
10.18.53.32:6789/0<http://10.18.53.32:6789/0> 54 : cluster [INF] Health check 
cleared: MDS_ALL_DOWN (was: 1 filesystem is offline)

When the active MDS role does not failover to the standby the MDS_ALL_DOWN 
check is not cleared until after the rebooted instances have come back up e.g.:

2018-08-25 03:30:02.936554 mon.dub-sitv-ceph-03 mon.0 
10.18.53.32:6789/0<http://10.18.53.32:6789/0> 55 : cluster [ERR] Health check 
failed: 1 filesystem is offline (MDS_ALL_DOWN)
2018-08-25 03:30:04.235703 mon.dub-sitv-ceph-05 mon.2 
10.18.186.208:6789/0<http://10.18.186.208:6789/0> 226 : cluster [INF] 
mon.dub-sitv-ceph-05 calling monitor election
2018-08-25 03:30:04.238672 mon.dub-sitv-ceph-03 mon.0 
10.18.53.32:6789/0<http://10.18.53.32:6789/0> 56 : cluster [INF] 
mon.dub-sitv-ceph-03 calling monitor election
2018-08-25 03:30:09.242595 mon.dub-sitv-ceph-03 mon.0 
10.18.53.32:6789/0<http://10.18.53.32:6789/0> 57 : cluster [INF] 
mon.dub-sitv-ceph-03 is new leader, mons dub-sitv-ceph-03,dub-sitv-ceph-05 in 
quorum (ranks 0,2)
2018-08-25 03:30:09.252804 mon.dub-sitv-ceph-03 mon.0 
10.18.53.32:6789/0<http://10.18.53.32:6789/0> 62 : cluster [WRN] Health check 
failed: 1/3 mons down, quorum dub-sitv-ceph-03,dub-sitv-ceph-05 (MON_DOWN)
2018-08-25 03:30:09.258693 mon.dub-sitv-ceph-03 mon.0 
10.18.53.32:6789/0<http://10.18.53.32:6789/0> 63 : cluster [WRN] overall 
HEALTH_WARN 2 osds down; 2 hosts (2 osds) down; 1/3 mons down, quorum 
dub-sitv-ceph-03,dub-sitv-ceph-05
2018-08-25 03:30:10.254162 mon.dub-sitv-ceph-03 mon.0 
10.18.53.32:6789/0<http://10.18.53.32:6789/0> 64 : cluster [WRN] Health check 
failed: Reduced data availability: 2 pgs inactive, 115 pgs peering 
(PG_AVAILABILITY)
2018-08-25 03:30:12.429145 mon.dub-sitv-ceph-03 mon.0 
10.18.53.32:6789/0<http://10.18.53.32:6789/0> 66 : cluster [WRN] Health check 
failed: Degraded data redundancy: 712/2504 objects degraded (28.435%), 86 pgs 
degraded (PG_DEGRADED)
2018-08-25 03:30:16.137408 mon.dub-sitv-ceph-03 mon.0 
10.18.53.32:6789/0<http://10.18.53.32:6789/0> 67 : cluster [WRN] Health check 
update: Reduced data availability: 1 pg inactive, 69 pgs peering 
(PG_AVAILABILITY)
2018-08-25 03:30:17.193322 mon.dub-sitv-ceph-03 mon.0 
10.18.53.32:6789/0<http://10.18.53.32:6789/0> 68 : cluster [INF] Health check 
cleared: PG_AVAILABILITY (was: Reduced data availability: 1 pg inactive, 69 pgs 
peering)
2018-08-25 03:30:18.432043 mon.dub-sitv-ceph-03 mon.0 
10.18.53.32:6789/0<http://10.18.53.32:6789/0> 69 : cluster [WRN] Health check 
update: Degraded data redundancy: 1286/2572 objects degraded (50.000%), 166 pgs 
degraded (PG_DEGRADED)
2018-08-25 03:30:26.139491 mon.dub-sitv-ceph-03 mon.0 
10.18.53.32:6789/0<http://10.18.53.32:6789/0> 71 : cluster [WRN] Health check 
update: Degraded data redundancy: 1292/2584 objects degraded (50.000%), 166 pgs 
degraded (PG_DEGRADED)
2018-08-25 03:30:31.355321 mon.dub-sitv-ceph-04 mon.1 
10.18.53.155:6789/0<http://10.18.53.155:6789/0> 1 : cluster [INF] 
mon.dub-sitv-ceph-04 calling monitor election
2018-08-25 03:30:31.371519 mon.dub-sitv-ceph-04 mon.1 
10.18.53.155:6789/0<http://10.18.53.155:6789/0> 2 : cluster [WRN] message from 
mon.0 was stamped 0.817433s in the future, clocks not synchronized
2018-08-25 03:30:32.175677 mon.dub-sitv-ceph-03 mon.0 
10.18.53.32:6789/0<http://10.18.53.32:6789/0> 72 : cluster [INF] 
mon.dub-sitv-ceph-03 calling monitor election
2018-08-25 03:30:32.175864 mon.dub-sitv-ceph-05 mon.2 
10.18.186.208:6789/0<http://10.18.186.208:6789/0> 227 : cluster [INF] 
mon.dub-sitv-ceph-05 calling monitor election
2018-08-25 03:30:32.180615 mon.dub-sitv-ceph-03 mon.0 
10.18.53.32:6789/0<http://10.18.53.32:6789/0> 73 : cluster [INF] 
mon.dub-sitv-ceph-03 is new leader, mons 
dub-sitv-ceph-03,dub-sitv-ceph-04,dub-sitv-ceph-05 in quorum (ranks 0,1,2)
2018-08-25 03:30:32.189593 mon.dub-sitv-ceph-03 mon.0 
10.18.53.32:6789/0<http://10.18.53.32:6789/0> 78 : cluster [INF] Health check 
cleared: MON_DOWN (was: 1/3 mons down, quorum dub-sitv-ceph-03,dub-sitv-ceph-05)
2018-08-25 03:30:32.190820 mon.dub-sitv-ceph-03 mon.0 
10.18.53.32:6789/0<http://10.18.53.32:6789/0> 79 : cluster [WRN] mon.1 
10.18.53.155:6789/0<http://10.18.53.155:6789/0> clock skew 0.811318s > max 0.05s
2018-08-25 03:30:32.194280 mon.dub-sitv-ceph-03 mon.0 
10.18.53.32:6789/0<http://10.18.53.32:6789/0> 80 : cluster [WRN] overall 
HEALTH_WARN 2 osds down; 2 hosts (2 osds) down; Degraded data redundancy: 
1292/2584 objects degraded (50.000%), 166 pgs degraded
2018-08-25 03:30:35.076121 mon.dub-sitv-ceph-03 mon.0 
10.18.53.32:6789/0<http://10.18.53.32:6789/0> 83 : cluster [INF] daemon 
mds.dub-sitv-ceph-02 restarted
2018-08-25 03:30:35.270222 mon.dub-sitv-ceph-03 mon.0 
10.18.53.32:6789/0<http://10.18.53.32:6789/0> 85 : cluster [WRN] Health check 
failed: 1 filesystem is degraded (FS_DEGRADED)
2018-08-25 03:30:35.270267 mon.dub-sitv-ceph-03 mon.0 
10.18.53.32:6789/0<http://10.18.53.32:6789/0> 86 : cluster [ERR] Health check 
failed: 1 filesystem is offline (MDS_ALL_DOWN)
2018-08-25 03:30:35.282139 mon.dub-sitv-ceph-03 mon.0 
10.18.53.32:6789/0<http://10.18.53.32:6789/0> 88 : cluster [INF] Standby daemon 
mds.dub-sitv-ceph-01 assigned to filesystem cephfs as rank 0
2018-08-25 03:30:35.282268 mon.dub-sitv-ceph-03 mon.0 
10.18.53.32:6789/0<http://10.18.53.32:6789/0> 89 : cluster [INF] Health check 
cleared: MDS_ALL_DOWN (was: 1 filesystem is offline)

In the MDS log we’ve noticed that when the issue occurs, at precisely the time 
when the active MDS/MON nodes are rebooted, the standby MDS instance briefly 
stops logging replay_done (as standby). This is shown in the log exert below 
where there is a 9s gap in these logs.

2018-08-25 03:30:00.085 7f3ab9b00700  1 mds.0.0 replay_done (as standby)
2018-08-25 03:30:01.091 7f3ab9b00700  1 mds.0.0 replay_done (as standby)
2018-08-25 03:30:10.332 7f3ab9b00700  1 mds.0.0 replay_done (as standby)
2018-08-25 03:30:11.333 7f3abb303700  1 mds.0.0 replay_done (as standby)

I’ve tried to reproduce the issue by rebooting each MDS instance in turn 
repeatedly 5 minutes apart but so far haven’t been able to do so, so my 
assumption is that rebooting the MDS and a MON instance at the same time is a 
significant factor.

Our mds_standby* configuration is set as follows:

    "mon_force_standby_active": "true",
    "mds_standby_for_fscid": "-1",
    "mds_standby_for_name": "",
    "mds_standby_for_rank": "0",
    "mds_standby_replay": "true",

The cluster status is as follows:

cluster:
    id:     f774b9b2-d514-40d9-85ab-d0389724b6c0
    health: HEALTH_OK

  services:
    mon: 3 daemons, quorum dub-sitv-ceph-03,dub-sitv-ceph-04,dub-sitv-ceph-05
    mgr: dub-sitv-ceph-04(active), standbys: dub-sitv-ceph-03, dub-sitv-ceph-05
    mds: cephfs-1/1/1 up  {0=dub-sitv-ceph-02=up:active}, 1 up:standby-replay
    osd: 4 osds: 4 up, 4 in

  data:
    pools:   2 pools, 200 pgs
    objects: 554  objects, 980 MiB
    usage:   7.9 GiB used, 1.9 TiB / 2.0 TiB avail
    pgs:     200 active+clean

  io:
    client:   1.5 MiB/s rd, 810 KiB/s wr, 286 op/s rd, 218 op/s wr

Hope someone can help!
William Lawton


_______________________________________________
ceph-users mailing list
[email protected]<mailto:[email protected]>
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] MDS does not always failover to hot standby on reboot

Reply via email to