Hello Michael, On Wed, Mar 11, 2020 at 1:24 AM Michael Bisig <michael.bi...@switch.ch> wrote: > > Hi all, > > I am trying to setup an active-active NFS Ganesha cluster (with two Ganeshas > (v3.0) running in Docker containers). I could manage to get two Ganesha > daemons running using the rados_cluster backend for active-active deployment. > I have the grace db within the cephfs metadata pool in an own namespace which > keeps track on the node status. > Now, I can mount the exposed filesystem over NFS (v4.1, v4.2) with both > daemons. So far so good. __ > > Testing high availability resulted in an unexpected behavior for that I am > not sure whether it is intentional or whether it is a configuration problem. > > Problem: > If both are running, no E or N flags are set within the grace db, as I > expect. Once, one host goes down (or is taken down) ALL clients cannot read > nor write to the mounted filesystem, even the clients which are not connected > to dead ganesha. In the db, I see that the dead ganesha has state NE and the > active has E. This state is what I expect from the Ganesha documentation. > Nevertheless, I would assume that the clients connected to the active daemon > are not blocked. This state is not cleaned up by itself (e.g. after the grace > period). > I can unlock this situation by 'lifting' the dead node with a direct db call > (using ganesha-rados-grace tool). But within an active-active deployment this > is not suitable. > > The ganesha config looks like: > > ------------ > NFS_CORE_PARAM > { > Enable_NLM = false; > Protocols = 4; > } > NFSv4 > { > RecoveryBackend = rados_cluster; > Minor_Versions = 1,2; > } > RADOS_KV > { > pool = "cephfsmetadata"; > nodeid = "a" ; > namespace = "grace"; > UserId = "ganesha"; > Ceph_Conf = "/etc/ceph/ceph.conf"; > } > MDCACHE { > Dir_Chunk = 0; > NParts = 1; > Cache_Size = 1; > } > EXPORT > { > Export_ID=101; > Protocols = 4; > Transports = TCP; > Path = PATH; > Pseudo = PSEUDO_PATH; > Access_Type = RW; > Attr_Expiration_Time = 0; > Squash = no_root_squash; > > FSAL { > Name = CEPH; > User_Id = "ganesha"; > Secret_Access_Key = CEPHXKEY; > } > } > LOG { > Default_Log_Level = "FULL_DEBUG"; > } > ------------ > > Does anyone have similar problems? Or if this behavior is by purpose, can you > explain to me why this is the case? > Thank you in advance for your time and thoughts.
Here's what Jeff Layton had to say (he didn't get the mail posting somehow): "Yes that is expected. Either the node needs to come back or you have to take the dead node out of the cluster using ganesha-rados-grace. [You] mention that doing the latter is "not suitable" for some reason, but I don't get why. If the node is down and not coming back, why wouldn't you declare it dead and just remove it?" -- Patrick Donnelly, Ph.D. He / Him / His Senior Software Engineer Red Hat Sunnyvale, CA GPG: 19F28A586F808C2402351B93C3301A3E258DD79D _______________________________________________ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io