On 14.02.2014 22:33, David Vossel wrote:
----- Original Message -----
From: "Dennis Jacobfeuerborn" <denni...@conversis.de>
To: "The Pacemaker cluster resource manager" <pacemaker@oss.clusterlabs.org>
Sent: Thursday, February 13, 2014 11:18:04 PM
Subject: Re: [Pacemaker] nfs4 cluster fail-over stops working once I introduce
ipaddr2 resource
On 14.02.2014 02:50, Dennis Jacobfeuerborn wrote:
Hi,
I'm still working on my NFSv4 cluster and things are working as
expected...as long as I don't add an IPAddr2 resource.
The DRBD, filesystem and exportfs resources work fine and when I put the
active node into standby everything fails over as expected.
Once I add a VIP as a IPAddr2 resource however I seem to get monitor
problems with the p_exportfs_root resource.
I've attached the configuration, status and a log file.
The transition status is the status a moment after I take nfs1
(192.168.100.41) offline. It looks like the stopping of p_ip_nfs does
something to the p_exportfs_root resource although I have no idea what
that could be.
The final status is the status after the cluster has settled. The
fail-over finished but the failed action is still present and cannot be
cleared with a "crm resource cleanup p_exportfs_root".
The log is the result of a "tail -f" on the corosync.log from the moment
before I issued the "crm node standby nfs1" to when the cluster has
settled.
Does anybody know what the issue could be here? At first I thought that
using a VIP from the same network as the cluster nodes could be an issue
but when I change this to use an IP in a different network
192.168.101.43/24 the same thing happens.
The moment I remove p_ip_nfs from the configuration again fail-over back
and forth works without a hitch.
So after a lot of digging I think I pinpointed the issue: A race between
the monitoring and stop actions of the exportfs resource script.
When "wait_for_leasetime_on_stop" is set the following happens for the
stop action and in this specific order:
1. The directory is unexported
2. Sleep nfs lease time + 2 seconds
The problem seems to be that during the sleep phase the monitoring
action is still invoked and since the directory has already been
unexported it reports a failure.
Once I add enabled="false" to the monitoring action of the exportfs
resource the problem disappears.
The question is how to ensure that the monitoring action is not called
while the stop action is still sleeping?
Would it be a solution to create a lock file for the duration of the
sleep and check for that lock file in the monitoring action?
I'm not 100% sure if this analysis is correct because if monitoring
right, I doubt that is happening.
What happens if you put the ip before the nfs server.
group p_ip_nfs g_nfs p_fs_data p_exportfs_root p_exportfs_data
What was missing was the following:
order o_nfsd_before_nfsgroup inf: cl_lsb_nfsserver g_nfs
(I hope this actually correct)
without that when I issued a standby the nfs-server was shut down
immediately and that apparently entails a call of "exportfs -au"
removing all exports which made the monitor actions trip up.
With this explicit ordering the nfsd will only be shut down after all
the export resources have been stopped which fixes the problem I've been
seeing.
If only finding the reason for issues would be as easy as fixing them...
Regards,
Dennis
_______________________________________________
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker
Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org