Re: [ceph-users] Intermittent client reconnect delay following node fail

John Spray Wed, 01 Aug 2018 05:15:03 -0700

On Wed, Aug 1, 2018 at 12:09 PM William Lawton
<william.law...@irdeto.com> wrote:
>
> Thanks for the advice John.
>
> Our CentOS 7 clients use linux kernel v3.10 so I upgraded one of them to use 
> v4.17 and have run 10 more node fail tests. Unfortunately, the kernel upgrade 
> on the client hasn't resolved the issue.
>
> With each test I took down the active MDS node and monitored how long the two 
> v3.10 clients and the v4.17 client lost the ceph mount for. There wasn't much 
> difference between them i.e. the v3.10 clients lost the mount for between 0 
> and 21 seconds and the v4.17 client for between 0 and 16 seconds. Sometimes 
> each node lost the mount at different times i.e. seconds apart. Other times, 
> 2 nodes would lose and recover the mount at exactly the same time and the 
> third node would lose/recover some time later.
>
> We are novices with Ceph so are not really sure what we should expect from it 
> regarding resilience i.e. is it normal for clients to lose the mount point 
> for a period of time and if so, how long should we consider an abnormal 
> period.


So with the more recent kernel you're finding the clients do reliably
reconnect, there's just some variation in the time it takes?  Or are
you still losing some clients entirely?

John


>
> William Lawton
>
> -----Original Message-----
> From: John Spray <jsp...@redhat.com>
> Sent: Tuesday, July 31, 2018 11:17 AM
> To: William Lawton <william.law...@irdeto.com>
> Cc: ceph-users@lists.ceph.com; Mark Standley <mark.stand...@irdeto.com>
> Subject: Re: [ceph-users] Intermittent client reconnect delay following node 
> fail
>
> On Tue, Jul 31, 2018 at 12:33 AM William Lawton <william.law...@irdeto.com> 
> wrote:
> >
> > Hi.
> >
> >
> >
> > We have recently setup our first ceph cluster (4 nodes) but our node 
> > failure tests have revealed an intermittent problem. When we take down a 
> > node (i.e. by powering it off) most of the time all clients reconnect to 
> > the cluster within milliseconds, but occasionally it can take them 30 
> > seconds or more. All clients are Centos7 instances and have the ceph 
> > cluster mount point configured in /etc/fstab as follows:
>
> The first thing I'd do is make sure you've got recent client code -- there 
> are backports in RHEL but I'm unclear on how much of that (if
> any) makes it into centos.  You may find it simpler to just install a recent 
> 4.x kernel from ELRepo.  Even if you don't want to use that in production, it 
> would be useful to try and isolate any CephFS client issues you're 
> encountering.
>
> John
>
> >
> >
> >
> > 10.18.49.35:6789,10.18.49.204:6789,10.18.49.101:6789,10.18.49.183:6789:/ 
> > /mnt/ceph ceph name=admin,secretfile=/etc/ceph_key,noatime,_netdev    0     
> >   2
> >
> >
> >
> > On rare occasions, using the ls command, we can see that a failover has 
> > left a client’s /mnt/ceph directory with the following state: “???????????  
> > ? ?    ?       ?            ? ceph”. When this occurs, we think that the 
> > client has failed to connect within 45 seconds (the mds_reconnect_timeout 
> > period) so the client has been evicted. We can reproduce this circumstance 
> > by reducing the mds reconnect timeout down to 1 second.
> >
> >
> >
> > We’d like to know why our clients sometimes struggle to reconnect after a 
> > cluster node failure and how to prevent this i.e. how can we ensure that 
> > all clients consistently reconnect to the cluster quickly following a node 
> > failure.
> >
> >
> >
> > We are using the default configuration options.
> >
> >
> >
> > Ceph Status:
> >
> >
> >
> >   cluster:
> >
> >     id:     ea2d9095-3deb-4482-bf6c-23229c594da4
> >
> >     health: HEALTH_OK
> >
> >
> >
> >   services:
> >
> >     mon: 4 daemons, quorum
> > dub-ceph-01,dub-ceph-03,dub-ceph-04,dub-ceph-02
> >
> >     mgr: dub-ceph-02(active), standbys: dub-ceph-04.ott.local,
> > dub-ceph-01, dub-ceph-03
> >
> >     mds: cephfs-1/1/1 up  {0=dub-ceph-03=up:active}, 3 up:standby
> >
> >     osd: 4 osds: 4 up, 4 in
> >
> >
> >
> >   data:
> >
> >     pools:   2 pools, 200 pgs
> >
> >     objects: 2.36 k objects, 8.9 GiB
> >
> >     usage:   31 GiB used, 1.9 TiB / 2.0 TiB avail
> >
> >     pgs:     200 active+clean
> >
> >
> >
> > Thanks
> >
> > William Lawton
> >
> >
> >
> > _______________________________________________
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Intermittent client reconnect delay following node fail

Reply via email to