> -----Original Message-----
> From: Ira Cooper [mailto:icoo...@redhat.com]
> Sent: 09 May 2016 17:31
> To: Sage Weil <s...@newdream.net>
> Cc: Nick Fisk <n...@fisk.me.uk>; ceph-users@lists.ceph.com
> Subject: Re: [ceph-users] CephFS + CTDB/Samba - MDS session timeout on
> lockfile
> ----- Original Message -----
> > On Mon, 9 May 2016, Nick Fisk wrote:
> > > Hi All,
> > >
> > > I've been testing an active/active Samba cluster over CephFS,
> > > performance seems really good with small files compared to Gluster.
> > > Soft reboots work beautifully with little to no interruption in file
> > > access. However when I perform a hard shutdown/reboot of one of the
> > > samba nodes, the remaining node detects that the other Samba node
> > > has disappeared but then eventually bans itself. If I leave
> > > everything for around 5 minutes, CTDB unbans itself and then
> > > everything continues running.
> > >
> > > From what I can work out it looks like as the MDS has a stale
> > > session from the powered down node, it won't let the remaining node
> > > access the CTDB lock file (which is also sitting the on the CephFS).
> > > CTDB meanwhile is hammering away trying to access the lock file, but
> > > it sees what it thinks is a split brain scenario because something
> > > still has a lock on the lockfile, and so bans itself.
> > >
> > > I'm guessing the solution is to either reduce the mds session
> > > timeout or increase the amount of time/retries for CTDB, but I'm not
> > > sure what's the best approach. Does anyone have any ideas?
> >
> > I believe Ira was looking at this exact issue, and addressed it by
> > lowering the mds_session_timeout to 30 seconds?
> Actually...
> There's a problem with the way I did it, in that there's issues in CephFS that
> start to come out.  Like the fact that it doesn't ban clients properly. :(

Could you shed any more light on what this issues might be? I'm assuming they 
are around the locking part of ctdb?

> Greg's made comments about this not being production safe, I tend to agree.
> ;)
> But it is possible, to make the cluster happy, I've been testing on VMs with
> the following added to my ceph.conf for "a while" now.
> mds_session_timeout = 5
> mds_tick_interval = 1
> mon_tick_interval = 1
> mon_session_timeout = 2
> mds_session_autoclose = 15

These all look like they make Ceph more responsive to the loss of a client, as 
per your warning above, what negative effects do you see potentially arising 
from them? Or is that more of a warning as they haven't had long term testing?

If the problem is only around the ctdb locking to avoid split brain, I would 
imagine using ctdb in conjunction with pacemaker to handle the fencing would 
also be a workaround?

> Since I did this, there have been changes made to CTDB to allow an external
> program to be the arbitrator instead of the fcntl lockfile.  I'm working on an
> etcd integration for that.  Not that it is that complicated, but making sure 
> you
> get the details right is a minor pain.
> Also I'll be giving a talk on all of this at SambaXP on Thursday, so if you 
> are
> there, feel free to catch me in the hall.  (That goes for anyone interested in
> this topic or ceph/samba topics in general!)

I would be really interested in slides/video if there will be any post event.

> Clearly my being at SambaXP will slow the etcd integration down.  And I'm
> betting Greg, John or Sage will want to talk to me about using mon instead of
> etcd ;).  Call it a "feeling".
> Cheers,
> -Ira

ceph-users mailing list

Reply via email to