Hi,

On 12/15/2015 10:22 PM, Gregory Farnum wrote:
On Tue, Dec 15, 2015 at 10:21 AM, Burkhard Linke
<burkhard.li...@computational.bio.uni-giessen.de> wrote:
Hi,

I have a setup with two MDS in active/standby configuration. During times of
high network load / network congestion, the active MDS is bounced between
both instances:

1. mons(?) decide that MDS A is crashed/not available due to missing
heartbeats

2015-12-15 16:38:08.471608 7f880df10700  1 mds.beacon.ceph-storage-01 _send
skipping beacon, heartbeat map not healthy
2015-12-15 16:38:10.534941 7f8813e4b700  1 heartbeat_map is_healthy 'MDS'
had timed out after 15
...
2015-12-15 16:38:15.468190 7f880e711700  1 heartbeat_map reset_timeout 'MDS'
had timed out after 15
2015-12-15 16:38:17.734172 7f8811818700  1 mds.-1.-1 handle_mds_map i
(192.168.6.129:6825/2846) dne in the mdsmap, respawning myself

2. Failover to standby MDS B
3. MDS B starts recover/rejoin (takes up to 15 minutes), introducing even
more load
4. MDS A is respawned as new standby MDS
5. mons kick out MDS B after timeout
6. Failover to MDS A
It takes 15 minutes to work through rejoin on your MDS? :/ You might
try running your daemons in standby-replay instead of just standby, so
that they have a warm cache.

You could also try to figure out if the limiting factor is MDS
throughput or OSD IOPs.
The secondary MDS is configured for standby-replay. The OSDs and the network are probably the limiting factors.

I'll try increasing mds_beacon_grace and the session timeout as John suggested.

Thank you for helping,
Burkhard

_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to