On Wed, Aug 1, 2018 at 12:09 PM William Lawton <william.law...@irdeto.com> wrote: > > Thanks for the advice John. > > Our CentOS 7 clients use linux kernel v3.10 so I upgraded one of them to use > v4.17 and have run 10 more node fail tests. Unfortunately, the kernel upgrade > on the client hasn't resolved the issue. > > With each test I took down the active MDS node and monitored how long the two > v3.10 clients and the v4.17 client lost the ceph mount for. There wasn't much > difference between them i.e. the v3.10 clients lost the mount for between 0 > and 21 seconds and the v4.17 client for between 0 and 16 seconds. Sometimes > each node lost the mount at different times i.e. seconds apart. Other times, > 2 nodes would lose and recover the mount at exactly the same time and the > third node would lose/recover some time later. > > We are novices with Ceph so are not really sure what we should expect from it > regarding resilience i.e. is it normal for clients to lose the mount point > for a period of time and if so, how long should we consider an abnormal > period.
So with the more recent kernel you're finding the clients do reliably reconnect, there's just some variation in the time it takes? Or are you still losing some clients entirely? John > > William Lawton > > -----Original Message----- > From: John Spray <jsp...@redhat.com> > Sent: Tuesday, July 31, 2018 11:17 AM > To: William Lawton <william.law...@irdeto.com> > Cc: ceph-users@lists.ceph.com; Mark Standley <mark.stand...@irdeto.com> > Subject: Re: [ceph-users] Intermittent client reconnect delay following node > fail > > On Tue, Jul 31, 2018 at 12:33 AM William Lawton <william.law...@irdeto.com> > wrote: > > > > Hi. > > > > > > > > We have recently setup our first ceph cluster (4 nodes) but our node > > failure tests have revealed an intermittent problem. When we take down a > > node (i.e. by powering it off) most of the time all clients reconnect to > > the cluster within milliseconds, but occasionally it can take them 30 > > seconds or more. All clients are Centos7 instances and have the ceph > > cluster mount point configured in /etc/fstab as follows: > > The first thing I'd do is make sure you've got recent client code -- there > are backports in RHEL but I'm unclear on how much of that (if > any) makes it into centos. You may find it simpler to just install a recent > 4.x kernel from ELRepo. Even if you don't want to use that in production, it > would be useful to try and isolate any CephFS client issues you're > encountering. > > John > > > > > > > > > 10.18.49.35:6789,10.18.49.204:6789,10.18.49.101:6789,10.18.49.183:6789:/ > > /mnt/ceph ceph name=admin,secretfile=/etc/ceph_key,noatime,_netdev 0 > > 2 > > > > > > > > On rare occasions, using the ls command, we can see that a failover has > > left a client’s /mnt/ceph directory with the following state: “??????????? > > ? ? ? ? ? ceph”. When this occurs, we think that the > > client has failed to connect within 45 seconds (the mds_reconnect_timeout > > period) so the client has been evicted. We can reproduce this circumstance > > by reducing the mds reconnect timeout down to 1 second. > > > > > > > > We’d like to know why our clients sometimes struggle to reconnect after a > > cluster node failure and how to prevent this i.e. how can we ensure that > > all clients consistently reconnect to the cluster quickly following a node > > failure. > > > > > > > > We are using the default configuration options. > > > > > > > > Ceph Status: > > > > > > > > cluster: > > > > id: ea2d9095-3deb-4482-bf6c-23229c594da4 > > > > health: HEALTH_OK > > > > > > > > services: > > > > mon: 4 daemons, quorum > > dub-ceph-01,dub-ceph-03,dub-ceph-04,dub-ceph-02 > > > > mgr: dub-ceph-02(active), standbys: dub-ceph-04.ott.local, > > dub-ceph-01, dub-ceph-03 > > > > mds: cephfs-1/1/1 up {0=dub-ceph-03=up:active}, 3 up:standby > > > > osd: 4 osds: 4 up, 4 in > > > > > > > > data: > > > > pools: 2 pools, 200 pgs > > > > objects: 2.36 k objects, 8.9 GiB > > > > usage: 31 GiB used, 1.9 TiB / 2.0 TiB avail > > > > pgs: 200 active+clean > > > > > > > > Thanks > > > > William Lawton > > > > > > > > _______________________________________________ > > ceph-users mailing list > > ceph-users@lists.ceph.com > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com _______________________________________________ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com