Hi all, I'm too testing with High Available NFS over TCP, I'd like to share my findings of a whole day of testing regarding NFS over TCP and some very interesting conclusions!
Note: I send these findings to Florian Haas of linbit (maintainer of the exportfs RA) and he noted that the exportfs RA is meant to be used in active active setups like Rasca's and not in active/passive setups (what I am testing at the moment). First of all I started with a fresh install of every node and rebooted the NFS client machine. While starting the first test I noticed the failover actually DID work. So I started to investigate further. After a few more failovers I was stuck at the situation where I have a stale mount on the client. These first test were all done using the migrate command. Rebooted everything again and started the second batch of tests (now with the node standby command). I noticed that this way of failover could survive way more failovers. Only when I started using the NFS mount during failover by writing something to it I noticed that the time it took to survive the failover increased consideratly. I digged deeper and started to monitor the nfs tcp connections using netstat and wrote down the results: node1 = active node node2 = passive node node1 netstat = tcp 0 0 *:nfs *:* LISTEN tcp 0 0 192.168.0.30:nfs 192.168.0.10:767 ESTABLISHED udp 0 0 *:nfs *:* node2 netstat = tcp 0 0 *:nfs *:* LISTEN udp 0 0 *:nfs *:* I did a failover (migrate resource) from node1 -> node2 node1 netstat = tcp 0 0 *:nfs *:* LISTEN tcp 0 0 192.168.0.30:nfs 192.168.0.10:767 ESTABLISHED udp 0 0 *:nfs *:* node2 netstat = tcp 0 0 *:nfs *:* LISTEN tcp 0 0 192.168.0.30:nfs 192.168.0.10:767 ESTABLISHED udp 0 0 *:nfs *:* Having the nfs-kernel-server LSB script run as a clone keeps tcp sessions ESTABLISHED on the passive node after a failover for about 10 minutes. After that the state changes to FIN_WAIT1 and lasts about another 4 minutes. During the time the session is ESTABLISHED and FIN_WAIT1 (about 14 minutes) it is not possible to migrate the resource back as this results in a stale mount, Then I started testing with node standby failovers and saw the following: node1 netstat = tcp 0 0 *:nfs *:* LISTEN tcp 0 0 192.168.0.30:nfs 192.168.0.10:767 ESTABLISHED udp 0 0 *:nfs *:* node2 netstat = tcp 0 0 *:nfs *:* LISTEN udp 0 0 *:nfs *:* I did a failover (node standby) from node1 -> node2 node1 netstat = tcp 0 0 *:nfs *:* LISTEN tcp 0 0 192.168.0.30:nfs 192.168.0.10:767 TIME_WAIT udp 0 0 *:nfs *:* node2 netstat = tcp 0 0 *:nfs *:* LISTEN tcp 0 0 192.168.0.30:nfs 192.168.0.10:767 ESTABLISHED udp 0 0 *:nfs *:* The session is immediatly changed to TIME_WAIT and lasts a bit shorter (around 2 minutes) then FIN_WAIT1 using the migrate command. It is still not possible to do a failover back during the FIN_WAIT1 state but after around 2 minutes the session is restored and doesn't become stale. I concluded that the stopping and starting of nfs-kernel-server (which happens only when doing node standby) is the main difference here. SO i started testing without having the nfs-kernel-server as a cloned resource but as a normal resource (so it gets stopped/started during failover) After a failover the tcp state of the passive node sometimes remained ESTABLISHED and sometimes became TIME_WAIT. I noticed that as i didn't use the nfs mount during failover the state became TIME_WAIT and if I did use the nfs mount it remained ESTABLISHED. So it had to do something with nfs-kernel-server not shutting down all connections on a stop command. I checked the /etc/init.d/nfs-kernel-server LSB script and saw that the stop command was using a signal 2 to stop all nfsd instances. I noticed when the session is active the nfsd instance is not stopped. So I changed the signal into 9 and then it killed all nfsd instances on a stop command. Conslusion: - *Using nfs-kernel-server as a cloned resource prevents quick failovers (<15 minutes) if you use NFS over TCP*, using it as a normal resource stops and starts the nfsd instances which keep the TCP connections. - For this to work in active/passive mode the nfs-kernel-server init script needs to be changed, the stop command must use signal 9 to kill all nfsd instances instead of signal 2 Kind regards, Caspar Smit 2011/4/14 Alessandro Iurlano <[email protected]> > Thanks a lot, Rasca. > Using your configuration I was able to setup the active active NFS server. > I had to use the UDP protocol for NFS to work. With TCP, the NFS > clients would occasionally hang. > With UDP it seems to work well without any need of rmtab file > replication/synchronization. > > Now I'm trying to go a little further by using OCFS2 cluster > filesystem with a dual primary DRBD configuration. The goal is to be > able to share the same directory from both nodes while still having > the failover on a single node. > With the actual configuration, the cluster comes up and every services > is running as expected. > But when I unplug the network cable of a node, on the remaining active > node the exportfs processes hangs and I can't see why. > Any suggestion? > > This is my current configuration: > http://nopaste.voric.com/paste.php?f=sxub6z > > Thanks! > Alessandro > > On Mon, Apr 4, 2011 at 11:23 AM, RaSca <[email protected]> wrote: > > Il giorno Sab 02 Apr 2011 19:04:08 CET, Alessandro Iurlano ha scritto: > >> > >> On Fri, Apr 1, 2011 at 11:34 AM, RaSca<[email protected]> > wrote: > >>>> > >>>> Then I tried to find a way to keep just the rmtab file synchronized on > >>>> both nodes. I cannot find a way to have pacemaker do this for me. Is > >>>> there one? > >>> > >>> As far as I know, all those operations are handled by the exportfs RA. > >> > >> I believe this was true till the backup part was removed. See the git > >> commit below. > > > > So, for some reasons this is not needed anymore, but I don't think this > may > > create problems, surely the RA maintainer has done all the necessary > tests. > > > >> I checked the boot order and indeed I was doing it the wrong way. > >> After I fixed it, a couple of tests worked right away, while the > >> client hanged again when I switched back the cluster to both nodes > >> online. > >> Could you post your working configuration? > >> Thanks, > >> Alessandro > > > > Here it is, note that I'm using DRBD instead of a shared storage > (basically > > each drbd is a stand alone export that can reside independently on a > node): > > > > node ubuntu-nodo1 > > node ubuntu-nodo2 > > primitive drbd0 ocf:linbit:drbd \ > > params drbd_resource="r0" \ > > op monitor interval="20s" timeout="40s" > > primitive drbd1 ocf:linbit:drbd \ > > params drbd_resource="r1" \ > > op monitor interval="20s" timeout="40s" > > primitive nfs-kernel-server lsb:nfs-kernel-server \ > > op monitor interval="10s" timeout="30s" > > primitive ping ocf:pacemaker:ping \ > > params host_list="172.16.0.1" multiplier="100" name="ping" \ > > op monitor interval="20s" timeout="60s" \ > > op start interval="0" timeout="60s" > > primitive portmap lsb:portmap \ > > op monitor interval="10s" timeout="30s" > > primitive share-a-exportfs ocf:heartbeat:exportfs \ > > params directory="/share-a" clientspec="172.16.0.0/24" > > options="rw,async,no_subtree_check,no_root_squash" fsid="1" \ > > op monitor interval="10s" timeout="30s" \ > > op start interval="0" timeout="40s" \ > > op stop interval="0" timeout="40s" > > primitive share-a-fs ocf:heartbeat:Filesystem \ > > params device="/dev/drbd0" directory="/share-a" fstype="ext3" > > options="noatime" fast_stop="no" \ > > op monitor interval="20s" timeout="40s" \ > > op start interval="0" timeout="60s" \ > > op stop interval="0" timeout="60s" > > primitive share-a-ip ocf:heartbeat:IPaddr2 \ > > params ip="172.16.0.63" nic="eth0" \ > > op monitor interval="20s" timeout="40s" > > primitive share-b-exportfs ocf:heartbeat:exportfs \ > > params directory="/share-b" clientspec="172.16.0.0/24" > > options="rw,no_root_squash" fsid="2" \ > > op monitor interval="10s" timeout="30s" \ > > op start interval="0" timeout="40s" \ > > op stop interval="0" timeout="40s" > > primitive share-b-fs ocf:heartbeat:Filesystem \ > > params device="/dev/drbd1" directory="/share-b" fstype="ext3" > > options="noatime" fast_stop="no" \ > > op monitor interval="20s" timeout="40s" \ > > op start interval="0" timeout="60s" \ > > op stop interval="0" timeout="60s" > > primitive share-b-ip ocf:heartbeat:IPaddr2 \ > > params ip="172.16.0.64" nic="eth0" \ > > op monitor interval="20s" timeout="40s" > > primitive statd lsb:statd \ > > op monitor interval="10s" timeout="30s" > > group nfs portmap statd nfs-kernel-server > > group share-a share-a-fs share-a-exportfs share-a-ip > > group share-b share-b-fs share-b-exportfs share-b-ip > > ms ms_drbd0 drbd0 \ > > meta master-max="1" master-node-max="1" clone-max="2" > > clone-node-max="1" notify="true" > > ms ms_drbd1 drbd1 \ > > meta master-max="1" master-node-max="1" clone-max="2" > > clone-node-max="1" notify="true" target-role="Started" > > clone nfs_clone nfs \ > > meta globally-unique="false" > > clone ping_clone ping \ > > meta globally-unique="false" > > location share-a_on_connected_node share-a \ > > rule $id="share-a_on_connected_node-rule" -inf: not_defined ping > or > > ping lte 0 > > location share-b_on_connected_node share-b \ > > rule $id="share-b_on_connected_node-rule" -inf: not_defined ping > or > > ping lte 0 > > colocation share-a_on_ms_drbd0 inf: share-a ms_drbd0:Master > > colocation share-b_on_ms_drbd1 inf: share-b ms_drbd1:Master > > order share-a_after_ms_drbd0 inf: ms_drbd0:promote share-a:start > > order share-b_after_ms_drbd1 inf: ms_drbd1:promote share-b:start > > property $id="cib-bootstrap-options" \ > > dc-version="1.0.8-042548a451fce8400660f6031f4da6f0223dd5dd" \ > > cluster-infrastructure="openais" \ > > expected-quorum-votes="2" \ > > no-quorum-policy="ignore" \ > > stonith-enabled="false" \ > > last-lrm-refresh="1301915944" > > > > Note that I've grouped all the nfs-server daemons (portmap, nfs-common > and > > nfs-kernel-server) in the cloned group nfs_clone. > > > > -- > > RaSca > > Mia Mamma Usa Linux: Niente รจ impossibile da capire, se lo spieghi bene! > > [email protected] > > http://www.miamammausalinux.org > > > > > _______________________________________________ > Linux-HA mailing list > [email protected] > http://lists.linux-ha.org/mailman/listinfo/linux-ha > See also: http://linux-ha.org/ReportingProblems > _______________________________________________ Linux-HA mailing list [email protected] http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
