Hi.

We have recently setup our first ceph cluster (4 nodes) but our node failure 
tests have revealed an intermittent problem. When we take down a node (i.e. by 
powering it off) most of the time all clients reconnect to the cluster within 
milliseconds, but occasionally it can take them 30 seconds or more. All clients 
are Centos7 instances and have the ceph cluster mount point configured in 
/etc/fstab as follows:

10.18.49.35:6789,10.18.49.204:6789,10.18.49.101:6789,10.18.49.183:6789:/ 
/mnt/ceph ceph name=admin,secretfile=/etc/ceph_key,noatime,_netdev    0       2

On rare occasions, using the ls command, we can see that a failover has left a 
client's /mnt/ceph directory with the following state: "???????????  ? ?    ?   
    ?            ? ceph". When this occurs, we think that the client has failed 
to connect within 45 seconds (the mds_reconnect_timeout period) so the client 
has been evicted. We can reproduce this circumstance by reducing the mds 
reconnect timeout down to 1 second.

We'd like to know why our clients sometimes struggle to reconnect after a 
cluster node failure and how to prevent this i.e. how can we ensure that all 
clients consistently reconnect to the cluster quickly following a node failure.

We are using the default configuration options.

Ceph Status:

  cluster:
    id:     ea2d9095-3deb-4482-bf6c-23229c594da4
    health: HEALTH_OK

  services:
    mon: 4 daemons, quorum dub-ceph-01,dub-ceph-03,dub-ceph-04,dub-ceph-02
    mgr: dub-ceph-02(active), standbys: dub-ceph-04.ott.local, dub-ceph-01, 
dub-ceph-03
    mds: cephfs-1/1/1 up  {0=dub-ceph-03=up:active}, 3 up:standby
    osd: 4 osds: 4 up, 4 in

  data:
    pools:   2 pools, 200 pgs
    objects: 2.36 k objects, 8.9 GiB
    usage:   31 GiB used, 1.9 TiB / 2.0 TiB avail
    pgs:     200 active+clean

Thanks
William Lawton

_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to