Re: [Linux-HA] problem with nfs and exportfs failover

William Seligman Mon, 16 Apr 2012 13:39:04 -0700

On 4/16/12 1:47 PM, Seth Galitzer wrote:
> Just a quick update.  I set the wait_for_leasetime_on_stop parameter on 
> the exportfs resource to false, no it no longer sleeps for 92 sec and 
> the switchover is instantaneous.  Now I just need to figure out how to 
> disable nfsv4 on the server side and I should be home-free.


As you're testing this, a couple of reminders/observations:

- You're exporting /exports/admin with option rw. If your clients are actually
writing to that directory, and you want to have true failover, you may need
NFSv4. I suggest running a test in which you have a client do an extended write
(with dd, for example) then pull the plug on coronado. Is your file or
filesystem trashed when you do this?

- If you don't need your clients to be able to write to /exports/admin, you have
to don't figure out how to turn off NFSv4 (on RHEL6, this is done by passing "-N
4" to nfsd, and is typically done in /etc/sysconfig/nfs). I have the following
exportfs definitions on my primary-primary cluster, and my failover tests work
just fine:
'
primitive ExportUsrNevis ocf:heartbeat:exportfs \
        description="Site-wide applications installed in /usr/nevis" \
        op start interval="0" timeout="40" \
        op stop interval="0" timeout="120" \
        params clientspec="*.nevis.columbia.edu" directory="/usr/nevis" 
fsid="20"
options="ro,no_root_squash,async" rmtab_backup="none"

Note that I'm exporting this directory ro. If I wanted to support writes with
failover (especially in a primary-primary setup!) I'd have tons more work to do.

I notice in the configuration you've posted, you haven't included fencing yet.
Don't forget this! And test it as well.

> On 04/16/2012 12:42 PM, Seth Galitzer wrote:
>> I've been poking at this more over the weekend and this morning.  And
>> while your tip about rmtab was useful, it still didn't resolve the
>> problem.  I also made sure that my exports were only being
>> handled/defined by pacemaker and not by /etc/exports.  Though for the
>> cloned nfsserver resource to work, it seems you need an /etc/exports
>> file to exist on the server, even if it's empty.
>>
>> It seems the clue as to what's going on is in this line from the log:
>>
>> coronado exportfs[20325]: INFO: Sleeping 92 seconds to accommodate for
>> NFSv4 lease expiry
>>
>> If I bump up the timeout for the exportfs resource to 95 sec, then after
>> the very long timeout, it switches over correctly.  So while this is a
>> working solution to the problem, a 95 sec timeout is a little long for
>> my personal comfort on a live and active fileserver.  Any idea what is
>> instigating this timeout?  Is is exportfs (looks that way from the log
>> entry), nfsd, or pacemaker?  If pacemaker, then where can I reduce or
>> remove this?
>>
>> I've been looking at disabling nfsv4 entirely on this server, as I don't
>> really need it, but haven't found a solution that works yet.  Tried the
>> suggestion in this thread, but it seems to be for mounts, not nfsd, and
>> still doesn't help:
>> http://lists.debian.org/debian-user/2011/11/msg01585.html
>>
>> Though I have found that v4 is being loaded on one host but not the
>> other.  So if I can find what's different, I may be able to make that work.
>>
>> coronado:~# rpcinfo -u localhost nfs
>> program 100003 version 2 ready and waiting
>> program 100003 version 3 ready and waiting
>> program 100003 version 4 ready and waiting
>>
>> cascadia:~# rpcinfo -u localhost nfs
>> program 100003 version 2 ready and waiting
>> program 100003 version 3 ready and waiting
>>
>> Any further suggestions are welcome.  I'll keep poking until I find a
>> solution.
>>
>> Thanks.
>> Seth
>>
>> On 04/16/2012 11:49 AM, William Seligman wrote:
>>> On 4/14/12 5:55 AM, emmanuel segura wrote:
>>>> Maybe the problem it's the primitive nfsserver lsb:nfs-kernel-server, i
>>>> think this primitive was stoped befoure exportfs-admin
>>>> ocf:heartbeat:exportfs
>>>>
>>>> And if i rember the lsb:nfs-kernel-server and exportfs agent does the same
>>>> thing
>>>>
>>>> the first use the os scripts and the second the cluster agents
>>>
>>> Now that Emmanuel has reminded me, I'll offer two more tips based on advice 
>>> he's
>>> given me in the past:
>>>
>>> - You can deal with issue he raises directly by putting additional 
>>> constraints
>>> in your setup, something like:
>>>
>>> colocation fs-homes-nfsserver inf: group-homes clone-nfsserver
>>> order nfssserver-before-homes inf: clone-nfsserver group-homes
>>>
>>> That will make sure that all the group-homes resources (including
>>> exportfs-admin) will not be run unless an instance of nfsserver is already
>>> running on that node.
>>>
>>> - There's a more fundamental question: Why are you placing the start/stop of
>>> your NFS server on both nodes under pacemaker control? Why not have the NFS
>>> server start at system startup on each node?
>>>
>>> The only reason I see for putting NFS under Pacemaker control is if there 
>>> are
>>> entries in your /etc/exports file (or the Debian equivalent) that won't work
>>> unless other Pacemaker-controlled resources are running, such as DRBD. If 
>>> that's
>>> the case, you're better off controlling them with Pacemaker exportfs 
>>> resources,
>>> the same as you're doing with exportfs-admin, instead of /etc/exports 
>>> entries.
>>>
>>>> Il giorno 14 aprile 2012 01:50, William Seligman<
>>>> [email protected]>   ha scritto:
>>>>
>>>>> On 4/13/12 7:18 PM, William Seligman wrote:
>>>>>> On 4/13/12 6:42 PM, Seth Galitzer wrote:
>>>>>>> In attempting to build a nice clean config, I'm now in a state where
>>>>>>> exportfs never starts.  It always times out and errors.
>>>>>>>
>>>>>>> crm config show is pasted here: http://pastebin.com/cKFFL0Xf
>>>>>>> syslog after an attempted restart here: http://pastebin.com/CHdF21M4
>>>>>>>
>>>>>>> Only IPs have been edited.
>>>>>>
>>>>>> It's clear that your exportfs resource is timing out for the admin
>>>>> resource.
>>>>>>
>>>>>> I'm no expert, but here are some "stupid exportfs tricks" to try:
>>>>>>
>>>>>> - Check your /etc/exports file (or whatever the equivalent is in Debian;
>>>>>> "man exportfs" will tell you) on both nodes. Make sure you're not already
>>>>>> exporting the directory when the NFS server starts.
>>>>>>
>>>>>> - Take out the exportfs-admin resource. Then try doing things manually:
>>>>>>
>>>>>> # exportfs x.x.x.0/24:/exports/admin
>>>>>>
>>>>>> Assuming that works, then look at the output of just
>>>>>>
>>>>>> # exportfs
>>>>>>
>>>>>> The clientspec reported by exportfs has to match the clientspec you put
>>>>>> into the resource exactly. If exportfs is canonicalizing or reporting the
>>>>>> clientspec differently, the exportfs monitor won't work. If this is the
>>>>>> case, change the clientspec parameter in exportfs-admin to match.
>>>>>>
>>>>>> If the output of exportfs has any results that span more than one line,
>>>>>> then you've got the problem that the patch I referred you to (quoted
>>>>>> below) is supposed to fix. You'll have to apply the patch to your
>>>>>> exportfs resource.
>>>>>
>>>>> Wait a second; I completely forgot about this thread that I started:
>>>>>
>>>>> <http://www.gossamer-threads.com/lists/linuxha/users/78585>
>>>>>
>>>>> The solution turned out to be to remove the .rmtab files from the
>>>>> directories I was exporting, deleting&   touching /var/lib/nfs/rmtab 
>>>>> (you'll
>>>>> have to look up the Debian location), and adding rmtab_backup="none" to 
>>>>> all
>>>>> my exportfs resources.
>>>>>
>>>>> Hopefully there's a solution for you in there somewhere!
>>>>>
>>>>>>> On 04/13/2012 01:51 PM, William Seligman wrote:
>>>>>>>> On 4/13/12 12:38 PM, Seth Galitzer wrote:
>>>>>>>>> I'm working through this howto doc:
>>>>>>>>> http://www.linbit.com/fileadmin/tech-guides/ha-nfs.pdf
>>>>>>>>> and am stuck at section 4.4.  When I put the primary node in standby, 
>>>>>>>>> it
>>>>>>>>> seems that NFS never releases the export, so it can't shut down, and
>>>>>>>>> thus can't get started on the secondary node.  Everything up to that
>>>>>>>>> point in the doc works fine and fails over correctly.  But once I add
>>>>>>>>> the exportfs resource, it fails.  I'm running this on debian wheezy 
>>>>>>>>> with
>>>>>>>>> the included standard packages, not custom.
>>>>>>>>>
>>>>>>>>> Any suggestions?  I'd be happy to post configs and logs if requested.
>>>>>>>>
>>>>>>>> Yes, please post the output of "crm configure show", the output of
>>>>>>>> "exportfs" while the resource is running properly, and the relevant
>>>>>>>> sections of your log file. I suggest using pastebin.com, to keep
>>>>>>>> mailboxes filling up with walls of text.
>>>>>>>>
>>>>>>>> In case you haven't seen this thread already, you might want to take a 
>>>>>>>> look:
>>>>>>>>
>>>>>>>> <http://www.gossamer-threads.com/lists/linuxha/dev/77166>
>>>>>>>>
>>>>>>>> And the resulting commit:
>>>>>>>> <https://github.com/ClusterLabs/resource-agents/commit/5b0bf96e77ed3c4e179c8b4c6a5ffd4709f8fdae>
>>>>>>>>
>>>>>>>> (Links courtesy of Lars Ellenberg.)
>>>>>>>>
>>>>>>>> The problem and patch discussed in those links doesn't quite match
>>>>>>>> what you describe. I mention it because I had to patch my exportfs
>>>>>>>> resource (in /usr/lib/ocf/resource.d/heartbeat/exportfs on my RHEL
>>>>>>>> systems) to get it to work properly in my setup.

-- 
Bill Seligman             | Phone: (914) 591-2823
Nevis Labs, Columbia Univ | mailto://[email protected]
PO Box 137                |
Irvington NY 10533 USA    | http://www.nevis.columbia.edu/~seligman/

smime.p7s
Description: S/MIME Cryptographic Signature

_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Re: [Linux-HA] problem with nfs and exportfs failover

Reply via email to