Hello Michael,

On Wed, Mar 11, 2020 at 1:24 AM Michael Bisig <michael.bi...@switch.ch> wrote:
>
> Hi all,
>
> I am trying to setup an active-active NFS Ganesha cluster (with two Ganeshas 
> (v3.0) running in Docker containers). I could manage to get two Ganesha 
> daemons running using the rados_cluster backend for active-active deployment. 
> I have the grace db within the cephfs metadata pool in an own namespace which 
> keeps track on the node status.
> Now, I can mount the exposed filesystem over NFS (v4.1, v4.2) with both 
> daemons. So far so good. __
>
> Testing high availability resulted in an unexpected behavior for that I am 
> not sure whether it is intentional or whether it is a configuration problem.
>
> Problem:
> If both are running, no E or N flags are set within the grace db, as I 
> expect. Once, one host goes down (or is taken down) ALL clients cannot read 
> nor write to the mounted filesystem, even the clients which are not connected 
> to dead ganesha. In the db, I see that the dead ganesha has state NE and the 
> active has E. This state is what I expect from the Ganesha documentation. 
> Nevertheless, I would assume that the clients connected to the active daemon 
> are not blocked. This state is not cleaned up by itself (e.g. after the grace 
> period).
> I can unlock this situation by 'lifting' the dead node with a direct db call 
> (using ganesha-rados-grace tool). But within an active-active deployment this 
> is not suitable.
>
> The ganesha config looks like:
>
> ------------
> NFS_CORE_PARAM
> {
>         Enable_NLM = false;
>         Protocols = 4;
> }
> NFSv4
> {
>         RecoveryBackend = rados_cluster;
>         Minor_Versions =  1,2;
> }
> RADOS_KV
> {
>     pool = "cephfsmetadata";
>     nodeid = "a" ;
>     namespace = "grace";
>     UserId = "ganesha";
>     Ceph_Conf = "/etc/ceph/ceph.conf";
> }
> MDCACHE {
>         Dir_Chunk = 0;
>         NParts = 1;
>         Cache_Size = 1;
> }
> EXPORT
> {
>         Export_ID=101;
>         Protocols = 4;
>         Transports = TCP;
>         Path = PATH;
>         Pseudo = PSEUDO_PATH;
>         Access_Type = RW;
>         Attr_Expiration_Time = 0;
>         Squash = no_root_squash;
>
>         FSAL {
>                 Name = CEPH;
>                 User_Id = "ganesha";
>                 Secret_Access_Key = CEPHXKEY;
>         }
> }
> LOG {
>         Default_Log_Level = "FULL_DEBUG";
> }
> ------------
>
> Does anyone have similar problems? Or if this behavior is by purpose, can you 
> explain to me why this is the case?
> Thank you in advance for your time and thoughts.

Here's what Jeff Layton had to say (he didn't get the mail posting somehow):

"Yes that is expected. Either the node needs to come back or you have
to take the dead node out of the cluster using ganesha-rados-grace.

[You] mention that doing the latter is "not suitable" for some
reason, but I don't get why. If the node is down and not coming back,
why wouldn't you declare it dead and just remove it?"

-- 
Patrick Donnelly, Ph.D.
He / Him / His
Senior Software Engineer
Red Hat Sunnyvale, CA
GPG: 19F28A586F808C2402351B93C3301A3E258DD79D
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

Reply via email to