Re: [Linux-HA] Problem with an active/active NFS setup with exportfs RA

Caspar Smit Fri, 15 Apr 2011 03:03:11 -0700

Hi all,

I'm too testing with High Available NFS over TCP, I'd like to share my
findings of a whole day of testing regarding NFS over TCP and some very
interesting conclusions!


Note: I send these findings to Florian Haas of linbit (maintainer of the
exportfs RA) and he noted that the exportfs RA is meant to be used in active
active setups like Rasca's and not in active/passive setups (what I am
testing at the moment).

First of all I started with a fresh install of every node and rebooted the
NFS client machine.

While starting the first test I noticed the failover actually DID work. So I
started to investigate further. After a few more failovers I was stuck at
the situation where I have a stale mount on the client.

These first test were all done using the migrate command.

Rebooted everything again and started the second batch of tests (now with
the node standby command). I noticed that this way of failover could survive
way more failovers. Only when I started using the NFS mount during failover
by writing something to it I noticed that the time it took to survive the
failover increased consideratly.

I digged deeper and started to monitor the nfs tcp connections using netstat
and wrote down the results:

node1 = active node
node2 = passive node

node1 netstat =
tcp        0      0 *:nfs                   *:*                     LISTEN
tcp        0      0 192.168.0.30:nfs        192.168.0.10:767
ESTABLISHED
udp        0      0 *:nfs                   *:*

node2 netstat =
tcp        0      0 *:nfs                   *:*                     LISTEN
udp        0      0 *:nfs                   *:*

I did a failover (migrate resource) from node1 -> node2

node1 netstat =
tcp        0      0 *:nfs                   *:*                     LISTEN
tcp        0      0 192.168.0.30:nfs        192.168.0.10:767
ESTABLISHED
udp        0      0 *:nfs                   *:*

node2 netstat =
tcp        0      0 *:nfs                   *:*                     LISTEN
tcp        0      0 192.168.0.30:nfs        192.168.0.10:767
ESTABLISHED
udp        0      0 *:nfs                   *:*

Having the nfs-kernel-server LSB script run as a clone keeps tcp sessions
ESTABLISHED on the passive node after a failover for about 10 minutes. After
that the state changes to FIN_WAIT1 and lasts about another 4 minutes.

During the time the session is ESTABLISHED and FIN_WAIT1 (about 14 minutes)
it is not possible to migrate the resource back as this results in a stale
mount,

Then I started testing with node standby failovers and saw the following:

node1 netstat =
tcp        0      0 *:nfs                   *:*                     LISTEN
tcp        0      0 192.168.0.30:nfs        192.168.0.10:767
ESTABLISHED
udp        0      0 *:nfs                   *:*

node2 netstat =
tcp        0      0 *:nfs                   *:*                     LISTEN
udp        0      0 *:nfs                   *:*

I did a failover (node standby) from node1 -> node2

node1 netstat =
tcp        0      0 *:nfs                   *:*                     LISTEN
tcp        0      0 192.168.0.30:nfs        192.168.0.10:767
TIME_WAIT
udp        0      0 *:nfs                   *:*

node2 netstat =
tcp        0      0 *:nfs                   *:*                     LISTEN
tcp        0      0 192.168.0.30:nfs        192.168.0.10:767
ESTABLISHED
udp        0      0 *:nfs                   *:*

The session is immediatly changed to TIME_WAIT and lasts a bit shorter
(around 2 minutes) then FIN_WAIT1 using the migrate command.

It is still not possible to do a failover back during the FIN_WAIT1 state
but after around 2 minutes the session is restored and doesn't become stale.


I concluded that the stopping and starting of nfs-kernel-server (which
happens only when doing node standby) is the main difference here.

SO i started testing without having the nfs-kernel-server as a cloned
resource but as a normal resource (so it gets stopped/started during
failover)

After a failover the tcp state of the passive node sometimes remained
ESTABLISHED and sometimes became TIME_WAIT.

I noticed that as i didn't use the nfs mount during failover the state
became TIME_WAIT and if I did use the nfs mount it remained ESTABLISHED.

So it had to do something with nfs-kernel-server not shutting down all
connections on a stop command. I checked the /etc/init.d/nfs-kernel-server
LSB script and saw
that the stop command was using a signal 2 to stop all nfsd instances. I
noticed when the session is active the nfsd instance is not stopped. So I
changed the signal into
9 and then it killed all nfsd instances on a stop command.

Conslusion:

- *Using nfs-kernel-server as a cloned resource prevents quick failovers
(<15 minutes) if you use NFS over TCP*, using it as a normal resource stops
and starts the nfsd instances which keep the TCP connections.
- For this to work in active/passive mode the nfs-kernel-server init script
needs to be changed, the stop command must use signal 9 to kill all nfsd
instances instead of signal 2

Kind regards,

Caspar Smit

2011/4/14 Alessandro Iurlano <[email protected]>

> Thanks a lot, Rasca.
> Using your configuration I was able to setup the active active NFS server.
> I had to use the UDP protocol for NFS to work. With TCP, the NFS
> clients would occasionally hang.
> With UDP it seems to work well without any need of rmtab file
> replication/synchronization.
>
> Now I'm trying to go a little further by using OCFS2 cluster
> filesystem with a dual primary DRBD configuration. The goal is to be
> able to share the same directory from both nodes while still having
> the failover on a single node.
> With the actual configuration, the cluster comes up and every services
> is running as expected.
> But when I unplug the network cable of a node, on the remaining active
> node the exportfs processes hangs and I can't see why.
> Any suggestion?
>
> This is my current configuration:
> http://nopaste.voric.com/paste.php?f=sxub6z
>
> Thanks!
> Alessandro
>
> On Mon, Apr 4, 2011 at 11:23 AM, RaSca <[email protected]> wrote:
> > Il giorno Sab 02 Apr 2011 19:04:08 CET, Alessandro Iurlano ha scritto:
> >>
> >> On Fri, Apr 1, 2011 at 11:34 AM, RaSca<[email protected]>
>  wrote:
> >>>>
> >>>> Then I tried to find a way to keep just the rmtab file synchronized on
> >>>> both nodes. I cannot find a way to have pacemaker do this for me. Is
> >>>> there one?
> >>>
> >>> As far as I know, all those operations are handled by the exportfs RA.
> >>
> >> I believe this was true till the backup part was removed. See the git
> >> commit below.
> >
> > So, for some reasons this is not needed anymore, but I don't think this
> may
> > create problems, surely the RA maintainer has done all the necessary
> tests.
> >
> >> I checked the boot order and indeed I was doing it the wrong way.
> >> After I fixed it, a couple of tests worked right away, while the
> >> client hanged again when I switched back the cluster to both nodes
> >> online.
> >> Could you post your working configuration?
> >> Thanks,
> >> Alessandro
> >
> > Here it is, note that I'm using DRBD instead of a shared storage
> (basically
> > each drbd is a stand alone export that can reside independently on a
> node):
> >
> > node ubuntu-nodo1
> > node ubuntu-nodo2
> > primitive drbd0 ocf:linbit:drbd \
> >        params drbd_resource="r0" \
> >        op monitor interval="20s" timeout="40s"
> > primitive drbd1 ocf:linbit:drbd \
> >        params drbd_resource="r1" \
> >        op monitor interval="20s" timeout="40s"
> > primitive nfs-kernel-server lsb:nfs-kernel-server \
> >        op monitor interval="10s" timeout="30s"
> > primitive ping ocf:pacemaker:ping \
> >        params host_list="172.16.0.1" multiplier="100" name="ping" \
> >        op monitor interval="20s" timeout="60s" \
> >        op start interval="0" timeout="60s"
> > primitive portmap lsb:portmap \
> >        op monitor interval="10s" timeout="30s"
> > primitive share-a-exportfs ocf:heartbeat:exportfs \
> >        params directory="/share-a" clientspec="172.16.0.0/24"
> > options="rw,async,no_subtree_check,no_root_squash" fsid="1" \
> >        op monitor interval="10s" timeout="30s" \
> >        op start interval="0" timeout="40s" \
> >        op stop interval="0" timeout="40s"
> > primitive share-a-fs ocf:heartbeat:Filesystem \
> >        params device="/dev/drbd0" directory="/share-a" fstype="ext3"
> > options="noatime" fast_stop="no" \
> >        op monitor interval="20s" timeout="40s" \
> >        op start interval="0" timeout="60s" \
> >        op stop interval="0" timeout="60s"
> > primitive share-a-ip ocf:heartbeat:IPaddr2 \
> >        params ip="172.16.0.63" nic="eth0" \
> >        op monitor interval="20s" timeout="40s"
> > primitive share-b-exportfs ocf:heartbeat:exportfs \
> >        params directory="/share-b" clientspec="172.16.0.0/24"
> > options="rw,no_root_squash" fsid="2" \
> >        op monitor interval="10s" timeout="30s" \
> >        op start interval="0" timeout="40s" \
> >        op stop interval="0" timeout="40s"
> > primitive share-b-fs ocf:heartbeat:Filesystem \
> >        params device="/dev/drbd1" directory="/share-b" fstype="ext3"
> > options="noatime" fast_stop="no" \
> >        op monitor interval="20s" timeout="40s" \
> >        op start interval="0" timeout="60s" \
> >        op stop interval="0" timeout="60s"
> > primitive share-b-ip ocf:heartbeat:IPaddr2 \
> >        params ip="172.16.0.64" nic="eth0" \
> >        op monitor interval="20s" timeout="40s"
> > primitive statd lsb:statd \
> >        op monitor interval="10s" timeout="30s"
> > group nfs portmap statd nfs-kernel-server
> > group share-a share-a-fs share-a-exportfs share-a-ip
> > group share-b share-b-fs share-b-exportfs share-b-ip
> > ms ms_drbd0 drbd0 \
> >        meta master-max="1" master-node-max="1" clone-max="2"
> > clone-node-max="1" notify="true"
> > ms ms_drbd1 drbd1 \
> >        meta master-max="1" master-node-max="1" clone-max="2"
> > clone-node-max="1" notify="true" target-role="Started"
> > clone nfs_clone nfs \
> >        meta globally-unique="false"
> > clone ping_clone ping \
> >        meta globally-unique="false"
> > location share-a_on_connected_node share-a \
> >        rule $id="share-a_on_connected_node-rule" -inf: not_defined ping
> or
> > ping lte 0
> > location share-b_on_connected_node share-b \
> >        rule $id="share-b_on_connected_node-rule" -inf: not_defined ping
> or
> > ping lte 0
> > colocation share-a_on_ms_drbd0 inf: share-a ms_drbd0:Master
> > colocation share-b_on_ms_drbd1 inf: share-b ms_drbd1:Master
> > order share-a_after_ms_drbd0 inf: ms_drbd0:promote share-a:start
> > order share-b_after_ms_drbd1 inf: ms_drbd1:promote share-b:start
> > property $id="cib-bootstrap-options" \
> >        dc-version="1.0.8-042548a451fce8400660f6031f4da6f0223dd5dd" \
> >        cluster-infrastructure="openais" \
> >        expected-quorum-votes="2" \
> >        no-quorum-policy="ignore" \
> >        stonith-enabled="false" \
> >        last-lrm-refresh="1301915944"
> >
> > Note that I've grouped all the nfs-server daemons (portmap, nfs-common
> and
> > nfs-kernel-server) in the cloned group nfs_clone.
> >
> > --
> > RaSca
> > Mia Mamma Usa Linux: Niente è impossibile da capire, se lo spieghi bene!
> > [email protected]
> > http://www.miamammausalinux.org
> >
> >
> _______________________________________________
> Linux-HA mailing list
> [email protected]
> http://lists.linux-ha.org/mailman/listinfo/linux-ha
> See also: http://linux-ha.org/ReportingProblems
>
_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Re: [Linux-HA] Problem with an active/active NFS setup with exportfs RA

Reply via email to