Re: [ceph-users] Intermittent client reconnect delay following node fail

John Spray Tue, 31 Jul 2018 03:17:58 -0700

On Tue, Jul 31, 2018 at 12:33 AM William Lawton
<william.law...@irdeto.com> wrote:
>
> Hi.
>
>
>
> We have recently setup our first ceph cluster (4 nodes) but our node failure 
> tests have revealed an intermittent problem. When we take down a node (i.e. 
> by powering it off) most of the time all clients reconnect to the cluster 
> within milliseconds, but occasionally it can take them 30 seconds or more. 
> All clients are Centos7 instances and have the ceph cluster mount point 
> configured in /etc/fstab as follows:


The first thing I'd do is make sure you've got recent client code --
there are backports in RHEL but I'm unclear on how much of that (if
any) makes it into centos.  You may find it simpler to just install a
recent 4.x kernel from ELRepo.  Even if you don't want to use that in
production, it would be useful to try and isolate any CephFS client
issues you're encountering.

John

>
>
>
> 10.18.49.35:6789,10.18.49.204:6789,10.18.49.101:6789,10.18.49.183:6789:/ 
> /mnt/ceph ceph name=admin,secretfile=/etc/ceph_key,noatime,_netdev    0       
> 2
>
>
>
> On rare occasions, using the ls command, we can see that a failover has left 
> a client’s /mnt/ceph directory with the following state: “???????????  ? ?    
> ?       ?            ? ceph”. When this occurs, we think that the client has 
> failed to connect within 45 seconds (the mds_reconnect_timeout period) so the 
> client has been evicted. We can reproduce this circumstance by reducing the 
> mds reconnect timeout down to 1 second.
>
>
>
> We’d like to know why our clients sometimes struggle to reconnect after a 
> cluster node failure and how to prevent this i.e. how can we ensure that all 
> clients consistently reconnect to the cluster quickly following a node 
> failure.
>
>
>
> We are using the default configuration options.
>
>
>
> Ceph Status:
>
>
>
>   cluster:
>
>     id:     ea2d9095-3deb-4482-bf6c-23229c594da4
>
>     health: HEALTH_OK
>
>
>
>   services:
>
>     mon: 4 daemons, quorum dub-ceph-01,dub-ceph-03,dub-ceph-04,dub-ceph-02
>
>     mgr: dub-ceph-02(active), standbys: dub-ceph-04.ott.local, dub-ceph-01, 
> dub-ceph-03
>
>     mds: cephfs-1/1/1 up  {0=dub-ceph-03=up:active}, 3 up:standby
>
>     osd: 4 osds: 4 up, 4 in
>
>
>
>   data:
>
>     pools:   2 pools, 200 pgs
>
>     objects: 2.36 k objects, 8.9 GiB
>
>     usage:   31 GiB used, 1.9 TiB / 2.0 TiB avail
>
>     pgs:     200 active+clean
>
>
>
> Thanks
>
> William Lawton
>
>
>
> _______________________________________________
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Intermittent client reconnect delay following node fail

Reply via email to