Hi Jason,

> On 17 Mar 2021, at 18:17, Jason Breitman <jbreit...@tildenparkcapital.com> 
> wrote:
> 
> Please review the details below and let me know if there is a setting that I 
> should apply to my FreeBSD NFS Server or if there is a bug fix that I can 
> apply to resolve my issue.
> I shared this information with the linux-nfs mailing list and they believe 
> the issue is on the server side.
> 
> Issue
> NFSv4 mounts periodically hang on the NFS Client.
> 
> During this time, it is possible to manually mount from another NFS Server on 
> the NFS Client having issues.
> Also, other NFS Clients are successfully mounting from the NFS Server in 
> question.
> Rebooting the NFS Client appears to be the only solution.

I had experienced a similar weird situation with periodically stuck Linux NFS 
clients mounting Isilon NFS servers (Isilon is FreeBSD based but they seem to 
have there own nfsd)
We’ve had better luck and we did manage to have packet captures on both sides 
during the issue. The gist of it goes like follows:

- Data flows correctly between SERVER and the CLIENT
- At some point SERVER starts decreasing it's TCP Receive Window until it 
reachs 0
- The client (eager to send data) can only ack data sent by SERVER.
- When SERVER was done sending data, the client starts sending TCP Window 
Probes hoping that the TCP Window opens again so he can flush its buffers.
- SERVER responds with a TCP Zero Window to those probes.
- After 6 minutes (the NFS server default Idle timeout) SERVER racefully closes 
the TCP connection sending a FIN Packet (and still a TCP Window at 0) 
- CLIENT ACK that FIN.
- SERVER goes in FIN_WAIT_2 state
- CLIENT closes its half part part of the socket and goes in LAST_ACK state.
- FIN is never sent by the client since there still data in its SendQ and 
receiver TCP Window is still 0. At this stage the client starts sending TCP 
Window Probes again and again hoping that the server opens its TCP Window so it 
can flush it's buffers and terminate its side of the socket.
- SERVER keeps responding with a TCP Zero Window to those probes.
=> The last two steps goes on and on for hours/days freezing the NFS mount 
bound to that TCP session.

If we had a situation where CLIENT was responsible for closing the TCP Window 
(and initiating the TCP FIN first) and server wanting to send data we’ll end up 
in the same state as you I think.

We’ve never had the root cause of why the SERVER decided to close the TCP 
Window and no more acccept data, the fix on the Isilon part was to recycle more 
aggressively the FIN_WAIT_2 sockets (net.inet.tcp.fast_finwait2_recycle=1 & 
net.inet.tcp.finwait2_timeout=5000). Once the socket recycled and at the next 
occurence of CLIENT TCP Window probe, SERVER sends a RST, triggering the 
teardown of the session on the client side, a new TCP handchake, etc and 
traffic flows again (NFS starts responding)

To avoid rebooting the client (and before the aggressive FIN_WAIT_2  was 
implemented on the Isilon side) we’ve added a check script on the client that 
detects LAST_ACK sockets on the client and through iptables rule enforces a TCP 
RST, Something like: -A OUTPUT -p tcp -d $nfs_server_addr --sport $local_port 
-j REJECT --reject-with tcp-reset (the script removes this iptables rule as 
soon as the LAST_ACK disappears)

The bottom line would be to have a packet capture during the outage (client 
and/or server side), it will show you at least the shape of the TCP exchange 
when NFS is stuck.

Youssef

_______________________________________________
freebsd-net@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"

Reply via email to