Hi Reed,
Thankyou so much  for the input and support. We have tried using the
variable suggested by you, but could not see any impact on the current
system.
*"ceph fs set cephfs allow_standby_replay true " *it did not create any *impact
in the failover time*

Furthermore we have tried more scenarios that we tested using our test :
*scenario 1:*
[image: image.png]

   - In this we have tried to see the logs at the new node on which the mds
   will failover to, i.e in this case if we reboot cephnode2 so new active MDS
   will be Cephnode1. Checking logs for cephnode1 in two scenarios:
   - 1. *normal reboot of Cephnode2 by keeping the I/O operation in
   progress,*
      - we see that log at cephnode1 instantiates immediately and then wait
      for sometime (around 15 seconds for some beacon time) + some
additional 6-7
      seconds during which it activated MDS on cephnode1 and resumes I/O. Refer
      logs as :
      - 2021-04-29T15:49:42.480+0530 7fa747690700  1 mds.cephnode1 Updating
      MDS map to version 505 from mon.2
      2021-04-29T15:49:42.482+0530 7fa747690700  1 mds.0.505 handle_mds_map
      i am now mds.0.505
      2021-04-29T15:49:42.482+0530 7fa747690700  1 mds.0.505 handle_mds_map
      state change up:boot --> up:replay
      2021-04-29T15:49:42.482+0530 7fa747690700  1 mds.0.505 replay_start
      2021-04-29T15:49:42.482+0530 7fa747690700  1 mds.0.505  recovery set
      is
      2021-04-29T15:49:42.482+0530 7fa747690700  1 mds.0.505  waiting for
      osdmap 486 (which blacklists prior instance)
      2021-04-29T15:49:55.686+0530 7fa74568c700  1 mds.beacon.cephnode1 MDS
      connection to Monitors appears to be laggy; 15.9769s since last
acked beacon
      2021-04-29T15:49:55.686+0530 7fa74568c700  1 mds.0.505 skipping
      upkeep work because connection to Monitors appears laggy
      2021-04-29T15:49:57.533+0530 7fa749e95700  0 mds.beacon.cephnode1
       MDS is no longer laggy
      2021-04-29T15:49:59.599+0530 7fa740e83700  0 mds.0.cache creating
      system inode with ino:0x100
      2021-04-29T15:49:59.599+0530 7fa740e83700  0 mds.0.cache creating
      system inode with ino:0x1
      2021-04-29T15:50:00.456+0530 7fa73f680700  1 mds.0.505 Finished
      replaying journal
      2021-04-29T15:50:00.456+0530 7fa73f680700  1 mds.0.505 making mds
      journal writeable
      2021-04-29T15:50:00.959+0530 7fa747690700  1 mds.cephnode1 Updating
      MDS map to version 506 from mon.2
      2021-04-29T15:50:00.959+0530 7fa747690700  1 mds.0.505 handle_mds_map
      i am now mds.0.505
      2021-04-29T15:50:00.959+0530 7fa747690700  1 mds.0.505 handle_mds_map
      state change up:replay --> up:reconnect
      2021-04-29T15:50:00.959+0530 7fa747690700  1 mds.0.505 reconnect_start
      2021-04-29T15:50:00.959+0530 7fa747690700  1 mds.0.505 reopen_log
      2021-04-29T15:50:00.959+0530 7fa747690700  1 mds.0.server
      reconnect_clients -- 2 sessions
      2021-04-29T15:50:00.964+0530 7fa747690700  0 log_channel(cluster) log
      [DBG] : reconnect by client.6892 v1:10.0.4.96:0/1646469259 after
      0.00499997
      2021-04-29T15:50:00.972+0530 7fa747690700  0 log_channel(cluster) log
      [DBG] : reconnect by client.6990 v1:10.0.4.115:0/2776266880 after
      0.0129999
      2021-04-29T15:50:00.972+0530 7fa747690700  1 mds.0.505 reconnect_done
      2021-04-29T15:50:02.005+0530 7fa747690700  1 mds.cephnode1 Updating
      MDS map to version 507 from mon.2
      2021-04-29T15:50:02.005+0530 7fa747690700  1 mds.0.505 handle_mds_map
      i am now mds.0.505
      2021-04-29T15:50:02.005+0530 7fa747690700  1 mds.0.505 handle_mds_map
      state change up:reconnect --> up:rejoin
      2021-04-29T15:50:02.005+0530 7fa747690700  1 mds.0.505 rejoin_start
      2021-04-29T15:50:02.008+0530 7fa747690700  1 mds.0.505
      rejoin_joint_start
      2021-04-29T15:50:02.040+0530 7fa740e83700  1 mds.0.505 rejoin_done
      2021-04-29T15:50:03.050+0530 7fa747690700  1 mds.cephnode1 Updating
      MDS map to version 508 from mon.2
      2021-04-29T15:50:03.050+0530 7fa747690700  1 mds.0.505 handle_mds_map
      i am now mds.0.505
      2021-04-29T15:50:03.050+0530 7fa747690700  1 mds.0.505 handle_mds_map
      state change up:rejoin --> up:clientreplay
      2021-04-29T15:50:03.050+0530 7fa747690700  1 mds.0.505 recovery_done
      -- successful recovery!
      2021-04-29T15:50:03.050+0530 7fa747690700  1 mds.0.505
      clientreplay_start
      2021-04-29T15:50:03.094+0530 7fa740e83700  1 mds.0.505
      clientreplay_done
      2021-04-29T15:50:04.081+0530 7fa747690700  1 mds.cephnode1 Updating
      MDS map to version 509 from mon.2
      2021-04-29T15:50:04.081+0530 7fa747690700  1 mds.0.505 handle_mds_map
      i am now mds.0.505
      2021-04-29T15:50:04.081+0530 7fa747690700  1 mds.0.505 handle_mds_map
      state change up:clientreplay --> up:active
      2021-04-29T15:50:04.081+0530 7fa747690700  1 mds.0.505 active_start
      2021-04-29T15:50:04.085+0530 7fa747690700  1 mds.0.505 cluster
      recovered.



   - *hard reset/power-off of  Cephnode2 by keeping the I/O operation in
   progress:*
   - In this case we see that the system logs at cephnode 1(on which new
      MDS will be activated) gets activated after 15+ seconds of power-off.
         - Time at which power-off was it : 2021-04-29-16-17-37
         - *Time at which the logs started to show in cephnode 1* (refer
         logs) i.e log started nearly after 15 seconds of hardware reset:


   - 2021-04-29T16:17:51.983+0530 7f5ba3a38700  1 mds.cephnode1 Updating
            MDS map to version 518 from mon.0
            2021-04-29T16:17:51.984+0530 7f5ba3a38700  1 mds.0.518
            handle_mds_map i am now mds.0.518
            2021-04-29T16:17:51.984+0530 7f5ba3a38700  1 mds.0.518
            handle_mds_map state change up:boot --> up:replay
            2021-04-29T16:17:51.984+0530 7f5ba3a38700  1 mds.0.518
            replay_start
            2021-04-29T16:17:51.984+0530 7f5ba3a38700  1 mds.0.518
             recovery set is
            2021-04-29T16:17:51.984+0530 7f5ba3a38700  1 mds.0.518  waiting
            for osdmap 504 (which blacklists prior instance)
            2021-04-29T16:17:54.044+0530 7f5b9ca2a700  0 mds.0.cache
            creating system inode with ino:0x100
            2021-04-29T16:17:54.045+0530 7f5b9ca2a700  0 mds.0.cache
            creating system inode with ino:0x1
            2021-04-29T16:17:55.025+0530 7f5b9ba28700  1 mds.0.518 Finished
            replaying journal
            2021-04-29T16:17:55.025+0530 7f5b9ba28700  1 mds.0.518 making
            mds journal writeable
            2021-04-29T16:17:56.060+0530 7f5ba3a38700  1 mds.cephnode1
            Updating MDS map to version 519 from mon.0
            2021-04-29T16:17:56.060+0530 7f5ba3a38700  1 mds.0.518
            handle_mds_map i am now mds.0.518
            2021-04-29T16:17:56.060+0530 7f5ba3a38700  1 mds.0.518
            handle_mds_map state change up:replay --> up:reconnect
            2021-04-29T16:17:56.060+0530 7f5ba3a38700  1 mds.0.518
            reconnect_start
            2021-04-29T16:17:56.060+0530 7f5ba3a38700  1 mds.0.518
            reopen_log
            2021-04-29T16:17:56.060+0530 7f5ba3a38700  1 mds.0.server
            reconnect_clients -- 2 sessions
            2021-04-29T16:17:56.068+0530 7f5ba3a38700  0
            log_channel(cluster) log [DBG] : reconnect by client.6990 v1:
            10.0.4.115:0/2776266880 after 0.00799994
            2021-04-29T16:17:56.069+0530 7f5ba3a38700  0
            log_channel(cluster) log [DBG] : reconnect by client.6892 v1:
            10.0.4.96:0/1646469259 after 0.00899994
            2021-04-29T16:17:56.069+0530 7f5ba3a38700  1 mds.0.518
            reconnect_done
            2021-04-29T16:17:57.099+0530 7f5ba3a38700  1 mds.cephnode1
            Updating MDS map to version 520 from mon.0
            2021-04-29T16:17:57.099+0530 7f5ba3a38700  1 mds.0.518
            handle_mds_map i am now mds.0.518
            2021-04-29T16:17:57.099+0530 7f5ba3a38700  1 mds.0.518
            handle_mds_map state change up:reconnect --> up:rejoin
            2021-04-29T16:17:57.099+0530 7f5ba3a38700  1 mds.0.518
            rejoin_start
            2021-04-29T16:17:57.103+0530 7f5ba3a38700  1 mds.0.518
            rejoin_joint_start
            2021-04-29T16:17:57.472+0530 7f5b9d22b700  1 mds.0.518
            rejoin_done
            2021-04-29T16:17:58.138+0530 7f5ba3a38700  1 mds.cephnode1
            Updating MDS map to version 521 from mon.0
            2021-04-29T16:17:58.138+0530 7f5ba3a38700  1 mds.0.518
            handle_mds_map i am now mds.0.518
            2021-04-29T16:17:58.138+0530 7f5ba3a38700  1 mds.0.518
            handle_mds_map state change up:rejoin --> up:clientreplay
            2021-04-29T16:17:58.138+0530 7f5ba3a38700  1 mds.0.518
            recovery_done -- successful recovery!
            2021-04-29T16:17:58.138+0530 7f5ba3a38700  1 mds.0.518
            clientreplay_start
            2021-04-29T16:17:58.157+0530 7f5b9d22b700  1 mds.0.518
            clientreplay_done
            2021-04-29T16:17:59.178+0530 7f5ba3a38700  1 mds.cephnode1
            Updating MDS map to version 522 from mon.0
            2021-04-29T16:17:59.178+0530 7f5ba3a38700  1 mds.0.518
            handle_mds_map i am now mds.0.518
            2021-04-29T16:17:59.178+0530 7f5ba3a38700  1 mds.0.518
            handle_mds_map state change up:clientreplay --> up:active
            2021-04-29T16:17:59.178+0530 7f5ba3a38700  1 mds.0.518
            active_start
            2021-04-29T16:17:59.181+0530 7f5ba3a38700  1 mds.0.518 cluster
            recovered.

*In both the test cases above* we saw some extra delay of* around 15
seconds* + 8-10 seconds. (total 21-25 seconds for failover in case of
power-off/reboot),

*Query:* Any specific config that may need to be tweaked/tried to reduce
this time for MDS to know that it has to activate and start the standby MDS
Node?)

  *Scenario 2:*

   - *Only stop MDS Daemon Service on Active Node*
      - In this scenario when we only tried stopping systemctl service for
      the MDS Node on Active Node, we have very good *reading of around 5-7
      Seconds* for failover.
      -
      Deployment Mode CEPH MDS Setup Test Case I/O Resume Duration
      (Seconds) Node affected
      -
      2 Node MDS Setupwith max_mds=1 Active-Standby MDS with Active Node
      MDS Demon stop 5-7 cephnode 1


Please suggest/advice if we can try to configure to achieve minimal
failover duration in the first two scenarios.

Best Regards,
Lokendra




On Thu, Apr 29, 2021 at 1:47 AM Reed Dier <reed.d...@focusvq.com> wrote:

> I don't have anything of merit to add to this, but it would be an
> interesting addition to your testing to see if active+standby-replay makes
> any difference with test-case1.
>
> I don't think it would be applicable to any of the other use-cases, as a
> standby-replay MDS is bound to a single rank, meaning its bound to a single
> active MDS, and can't function as a standby for active:active.
>
> https://docs.ceph.com/en/latest/cephfs/standby/#configuring-standby-replay
>
>
> https://access.redhat.com/documentation/en-us/red_hat_ceph_storage/2/html/ceph_file_system_guide_technology_preview/installing_and_configuring_ceph_metadata_servers_mds#mds-configuring-standby-daemons-standby-replay
>
> Good luck and look forward to hearing feedback/more results.
>
> Reed
>
> On Apr 27, 2021, at 8:40 AM, Lokendra Rathour <lokendrarath...@gmail.com>
> wrote:
>
> Hi Team,
> We have setup two Node Ceph Cluster using *Native Cephfs Driver* with *Details
> as:*
>
>    - 3 Node / 2 Node MDS Cluster
>    - 3 Node Monitor Quorum
>    - 2 Node OSD
>    - 2 Nodes for Manager
>
>
> Cephnode3 have only Mon and MDS (only for test case 4-7) rest two nodes
> i.e. cephnode1 and cephnode2 have (mgr,mds,mon,rgw)
>
>
> We have tested following failover scenarios for Native Cephfs Driver by
> mounting for any one sub-volume on a VM or client with continuous I/O
> operations(Directory creation after every 1 Second)*:*
>
> <image.png>
>
>
> In the table above we have few queries as:
>
>    - Refer test case 2 and test case 7, both are similar test case with
>    only difference in number of Ceph MDS with time for both the test cases is
>    different. It should be zero. But time is coming as 17 seconds for testcase
>    7.
>    - Is there any configurable parameter/any configuration which we need
>    to make in the Ceph cluster to get the failover time reduced to few
>    seconds?
>
> In current default deployment we are getting something around 35-40
> seconds.
>
>
>
>
>
>
> Best Regards,
>
> --
> ~ Lokendra
> www.inertiaspeaks.com
> www.inertiagroups.com
> skype: lokendrarathour
>
>
> _______________________________________________
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
>
>

-- 
~ Lokendra
www.inertiaspeaks.com
www.inertiagroups.com
skype: lokendrarathour
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

Reply via email to