Re: [ceph-users] MDS does not always failover to hot standby on reboot

Gregory Farnum Thu, 30 Aug 2018 12:49:23 -0700

LOn Thu, Aug 30, 2018 at 12:46 PM William Lawton <william.law...@irdeto.com>
wrote:


Oh i see. We’d taken steps to reduce the risk of losing the active mds and
> mon leader instances at the same time in the hope that it would prevent
> this issue. Do you know if the mds always connects to a specific mon
> instance i.e. the mon provider and can it be determined which mon instance
> that is? Or is it adhoc?
>



On Thu, Aug 30, 2018 at 9:45 AM Gregory Farnum <gfar...@redhat.com> wrote:

> If you need to co-locate, one thing that would make it better without
> being a lot of work is trying to have the MDS connect to one of the
> monitors on a different host. You can do that by just restricting the list
> of monitors you feed it in the ceph.conf, although it's not a guarantee
> that will *prevent* it from connecting to its own monitor if there are
> failures or reconnects after first startup.
>

:)


> Sent from my iPhone
>
> On 30 Aug 2018, at 20:01, Gregory Farnum <gfar...@redhat.com> wrote:
>
> Okay, well that will be the same reason then. If the active MDS is
> connectedng  to a monitor and they fail at the same time, the monitors
> can’t replace the mds until they’ve been through their own election and a
> full mds timeout window.
>
>
> On Thu, Aug 30, 2018 at 11:46 AM William Lawton <william.law...@irdeto.com>
> wrote:
>
>> Thanks for the response Greg. We did originally have co-located mds and
>> mon but realised this wasn't a good idea early on and separated them out
>> onto different hosts. So our mds hosts are on ceph-01 and ceph-02, and our
>> mon hosts are on ceph-03, 04 and 05. Unfortunately we see this issue
>> occurring when we reboot ceph-02(mds) and ceph-04(mon) together. We expect
>> ceph-01 to become the active mds but often it doesnt.
>>
>> Sent from my iPhone
>>
>> On 30 Aug 2018, at 17:46, Gregory Farnum <gfar...@redhat.com> wrote:
>>
>> Yes, this is a consequence of co-locating the MDS and monitors — if the
>> MDS reports to its co-located monitor and both fail, the monitor cluster
>> has to go through its own failure detection and then wait for a full MDS
>> timeout period after that before it marks the MDS down. :(
>>
>> We might conceivably be able to optimize for this, but there's not a
>> general solution. If you need to co-locate, one thing that would make it
>> better without being a lot of work is trying to have the MDS connect to one
>> of the monitors on a different host. You can do that by just restricting
>> the list of monitors you feed it in the ceph.conf, although it's not a
>> guarantee that will *prevent* it from connecting to its own monitor if
>> there are failures or reconnects after first startup.
>> -Greg
>>
>> On Thu, Aug 30, 2018 at 8:38 AM William Lawton <william.law...@irdeto.com>
>> wrote:
>>
>>> Hi.
>>>
>>>
>>>
>>> We have a 5 node Ceph cluster (refer to ceph -s output at bottom of
>>> email). During resiliency tests we have an occasional problem when we
>>> reboot the active MDS instance and a MON instance together i.e.
>>>  dub-sitv-ceph-02 and dub-sitv-ceph-04. We expect the MDS to failover to
>>> the standby instance dub-sitv-ceph-01 which is in standby-replay mode, and
>>> 80% of the time it does with no problems. However, 20% of the time it
>>> doesn’t and the MDS_ALL_DOWN health check is not cleared until 30 seconds
>>> later when the rebooted dub-sitv-ceph-02 and dub-sitv-ceph-04 instances
>>> come back up.
>>>
>>>
>>>
>>> When the MDS successfully fails over to the standby we see in the
>>> ceph.log the following:
>>>
>>>
>>>
>>> 2018-08-25 00:30:02.231811 mon.dub-sitv-ceph-03 mon.0 10.18.53.32:6789/0
>>> 50 : cluster [ERR] Health check failed: 1 filesystem is offline
>>> (MDS_ALL_DOWN)
>>>
>>> 2018-08-25 00:30:02.237389 mon.dub-sitv-ceph-03 mon.0 10.18.53.32:6789/0
>>> 52 : cluster [INF] Standby daemon mds.dub-sitv-ceph-01 assigned to
>>> filesystem cephfs as rank 0
>>>
>>> 2018-08-25 00:30:02.237528 mon.dub-sitv-ceph-03 mon.0 10.18.53.32:6789/0
>>> 54 : cluster [INF] Health check cleared: MDS_ALL_DOWN (was: 1 filesystem is
>>> offline)
>>>
>>>
>>>
>>> When the active MDS role does not failover to the standby the
>>> MDS_ALL_DOWN check is not cleared until after the rebooted instances have
>>> come back up e.g.:
>>>
>>>
>>>
>>> 2018-08-25 03:30:02.936554 mon.dub-sitv-ceph-03 mon.0 10.18.53.32:6789/0
>>> 55 : cluster [ERR] Health check failed: 1 filesystem is offline
>>> (MDS_ALL_DOWN)
>>>
>>> 2018-08-25 03:30:04.235703 mon.dub-sitv-ceph-05 mon.2
>>> 10.18.186.208:6789/0 226 : cluster [INF] mon.dub-sitv-ceph-05 calling
>>> monitor election
>>>
>>> 2018-08-25 03:30:04.238672 mon.dub-sitv-ceph-03 mon.0 10.18.53.32:6789/0
>>> 56 : cluster [INF] mon.dub-sitv-ceph-03 calling monitor election
>>>
>>> 2018-08-25 03:30:09.242595 mon.dub-sitv-ceph-03 mon.0 10.18.53.32:6789/0
>>> 57 : cluster [INF] mon.dub-sitv-ceph-03 is new leader, mons
>>> dub-sitv-ceph-03,dub-sitv-ceph-05 in quorum (ranks 0,2)
>>>
>>> 2018-08-25 03:30:09.252804 mon.dub-sitv-ceph-03 mon.0 10.18.53.32:6789/0
>>> 62 : cluster [WRN] Health check failed: 1/3 mons down, quorum
>>> dub-sitv-ceph-03,dub-sitv-ceph-05 (MON_DOWN)
>>>
>>> 2018-08-25 03:30:09.258693 mon.dub-sitv-ceph-03 mon.0 10.18.53.32:6789/0
>>> 63 : cluster [WRN] overall HEALTH_WARN 2 osds down; 2 hosts (2 osds) down;
>>> 1/3 mons down, quorum dub-sitv-ceph-03,dub-sitv-ceph-05
>>>
>>> 2018-08-25 03:30:10.254162 mon.dub-sitv-ceph-03 mon.0 10.18.53.32:6789/0
>>> 64 : cluster [WRN] Health check failed: Reduced data availability: 2 pgs
>>> inactive, 115 pgs peering (PG_AVAILABILITY)
>>>
>>> 2018-08-25 03:30:12.429145 mon.dub-sitv-ceph-03 mon.0 10.18.53.32:6789/0
>>> 66 : cluster [WRN] Health check failed: Degraded data redundancy: 712/2504
>>> objects degraded (28.435%), 86 pgs degraded (PG_DEGRADED)
>>>
>>> 2018-08-25 03:30:16.137408 mon.dub-sitv-ceph-03 mon.0 10.18.53.32:6789/0
>>> 67 : cluster [WRN] Health check update: Reduced data availability: 1 pg
>>> inactive, 69 pgs peering (PG_AVAILABILITY)
>>>
>>> 2018-08-25 03:30:17.193322 mon.dub-sitv-ceph-03 mon.0 10.18.53.32:6789/0
>>> 68 : cluster [INF] Health check cleared: PG_AVAILABILITY (was: Reduced data
>>> availability: 1 pg inactive, 69 pgs peering)
>>>
>>> 2018-08-25 03:30:18.432043 mon.dub-sitv-ceph-03 mon.0 10.18.53.32:6789/0
>>> 69 : cluster [WRN] Health check update: Degraded data redundancy: 1286/2572
>>> objects degraded (50.000%), 166 pgs degraded (PG_DEGRADED)
>>>
>>> 2018-08-25 03:30:26.139491 mon.dub-sitv-ceph-03 mon.0 10.18.53.32:6789/0
>>> 71 : cluster [WRN] Health check update: Degraded data redundancy: 1292/2584
>>> objects degraded (50.000%), 166 pgs degraded (PG_DEGRADED)
>>>
>>> 2018-08-25 03:30:31.355321 mon.dub-sitv-ceph-04 mon.1
>>> 10.18.53.155:6789/0 1 : cluster [INF] mon.dub-sitv-ceph-04 calling
>>> monitor election
>>>
>>> 2018-08-25 03:30:31.371519 mon.dub-sitv-ceph-04 mon.1
>>> 10.18.53.155:6789/0 2 : cluster [WRN] message from mon.0 was stamped
>>> 0.817433s in the future, clocks not synchronized
>>>
>>> 2018-08-25 03:30:32.175677 mon.dub-sitv-ceph-03 mon.0 10.18.53.32:6789/0
>>> 72 : cluster [INF] mon.dub-sitv-ceph-03 calling monitor election
>>>
>>> 2018-08-25 03:30:32.175864 mon.dub-sitv-ceph-05 mon.2
>>> 10.18.186.208:6789/0 227 : cluster [INF] mon.dub-sitv-ceph-05 calling
>>> monitor election
>>>
>>> 2018-08-25 03:30:32.180615 mon.dub-sitv-ceph-03 mon.0 10.18.53.32:6789/0
>>> 73 : cluster [INF] mon.dub-sitv-ceph-03 is new leader, mons
>>> dub-sitv-ceph-03,dub-sitv-ceph-04,dub-sitv-ceph-05 in quorum (ranks 0,1,2)
>>>
>>> 2018-08-25 03:30:32.189593 mon.dub-sitv-ceph-03 mon.0 10.18.53.32:6789/0
>>> 78 : cluster [INF] Health check cleared: MON_DOWN (was: 1/3 mons down,
>>> quorum dub-sitv-ceph-03,dub-sitv-ceph-05)
>>>
>>> 2018-08-25 03:30:32.190820 mon.dub-sitv-ceph-03 mon.0 10.18.53.32:6789/0
>>> 79 : cluster [WRN] mon.1 10.18.53.155:6789/0 clock skew 0.811318s > max
>>> 0.05s
>>>
>>> 2018-08-25 03:30:32.194280 mon.dub-sitv-ceph-03 mon.0 10.18.53.32:6789/0
>>> 80 : cluster [WRN] overall HEALTH_WARN 2 osds down; 2 hosts (2 osds) down;
>>> Degraded data redundancy: 1292/2584 objects degraded (50.000%), 166 pgs
>>> degraded
>>>
>>> 2018-08-25 03:30:35.076121 mon.dub-sitv-ceph-03 mon.0 10.18.53.32:6789/0
>>> 83 : cluster [INF] daemon mds.dub-sitv-ceph-02 restarted
>>>
>>> 2018-08-25 03:30:35.270222 mon.dub-sitv-ceph-03 mon.0 10.18.53.32:6789/0
>>> 85 : cluster [WRN] Health check failed: 1 filesystem is degraded
>>> (FS_DEGRADED)
>>>
>>> 2018-08-25 03:30:35.270267 mon.dub-sitv-ceph-03 mon.0 10.18.53.32:6789/0
>>> 86 : cluster [ERR] Health check failed: 1 filesystem is offline
>>> (MDS_ALL_DOWN)
>>>
>>> 2018-08-25 03:30:35.282139 mon.dub-sitv-ceph-03 mon.0 10.18.53.32:6789/0
>>> 88 : cluster [INF] Standby daemon mds.dub-sitv-ceph-01 assigned to
>>> filesystem cephfs as rank 0
>>>
>>> 2018-08-25 03:30:35.282268 mon.dub-sitv-ceph-03 mon.0 10.18.53.32:6789/0
>>> 89 : cluster [INF] Health check cleared: MDS_ALL_DOWN (was: 1 filesystem is
>>> offline)
>>>
>>>
>>>
>>> In the MDS log we’ve noticed that when the issue occurs, at precisely
>>> the time when the active MDS/MON nodes are rebooted, the standby MDS
>>> instance briefly stops logging replay_done (as standby). This is shown in
>>> the log exert below where there is a 9s gap in these logs.
>>>
>>>
>>>
>>> 2018-08-25 03:30:00.085 7f3ab9b00700  1 mds.0.0 replay_done (as standby)
>>>
>>> 2018-08-25 03:30:01.091 7f3ab9b00700  1 mds.0.0 replay_done (as standby)
>>>
>>> 2018-08-25 03:30:10.332 7f3ab9b00700  1 mds.0.0 replay_done (as standby)
>>>
>>> 2018-08-25 03:30:11.333 7f3abb303700  1 mds.0.0 replay_done (as standby)
>>>
>>>
>>>
>>> I’ve tried to reproduce the issue by rebooting each MDS instance in turn
>>> repeatedly 5 minutes apart but so far haven’t been able to do so, so my
>>> assumption is that rebooting the MDS and a MON instance at the same time is
>>> a significant factor.
>>>
>>>
>>>
>>> Our mds_standby* configuration is set as follows:
>>>
>>>
>>>
>>>     "mon_force_standby_active": "true",
>>>
>>>     "mds_standby_for_fscid": "-1",
>>>
>>>     "mds_standby_for_name": "",
>>>
>>>     "mds_standby_for_rank": "0",
>>>
>>>     "mds_standby_replay": "true",
>>>
>>>
>>>
>>> The cluster status is as follows:
>>>
>>>
>>>
>>> cluster:
>>>
>>>     id:     f774b9b2-d514-40d9-85ab-d0389724b6c0
>>>
>>>     health: HEALTH_OK
>>>
>>>
>>>
>>>   services:
>>>
>>>     mon: 3 daemons, quorum
>>> dub-sitv-ceph-03,dub-sitv-ceph-04,dub-sitv-ceph-05
>>>
>>>     mgr: dub-sitv-ceph-04(active), standbys: dub-sitv-ceph-03,
>>> dub-sitv-ceph-05
>>>
>>>     mds: cephfs-1/1/1 up  {0=dub-sitv-ceph-02=up:active}, 1
>>> up:standby-replay
>>>
>>>     osd: 4 osds: 4 up, 4 in
>>>
>>>
>>>
>>>   data:
>>>
>>>     pools:   2 pools, 200 pgs
>>>
>>>     objects: 554  objects, 980 MiB
>>>
>>>     usage:   7.9 GiB used, 1.9 TiB / 2.0 TiB avail
>>>
>>>     pgs:     200 active+clean
>>>
>>>
>>>
>>>   io:
>>>
>>>     client:   1.5 MiB/s rd, 810 KiB/s wr, 286 op/s rd, 218 op/s wr
>>>
>>>
>>>
>>> Hope someone can help!
>>>
>>> *William Lawton*
>>>
>>>
>>> _______________________________________________
>>> ceph-users mailing list
>>> ceph-users@lists.ceph.com
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>
>>

_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] MDS does not always failover to hot standby on reboot

Reply via email to