nfs stalls client: nfsrv_cache_session: no session

2022-07-16 Thread Peter
Hija,
  I have a problem with NFSv4:

The configuration:
  Server Rel. 13.1-RC2
nfs_server_enable="YES"
nfs_server_flags="-u -t --minthreads 2 --maxthreads 20 -h ..."
mountd_enable="YES"
mountd_flags="-S -p 803 -h ..."
rpc_lockd_enable="YES"
rpc_lockd_flags="-h ..."
rpc_statd_enable="YES"
rpc_statd_flags="-h ..."
rpcbind_enable="YES"
rpcbind_flags="-h ..."
nfsv4_server_enable="YES"
sysctl vfs.nfs.enable_uidtostring=1
sysctl vfs.nfsd.enable_stringtouid=1

  Client bhyve Rel. 13.1-RELEASE on the same system
nfs_client_enable="YES"
nfs_access_cache="600"
nfs_bufpackets="32"
nfscbd_enable="YES"

  Mount-options: nfsv4,readahead=1,rw,async


Access to the share suddenly stalled. Server reports this in messages,
every second:
   nfsrv_cache_session: no session IPaddr=192.168...

Restarting nfsd and mountd didn't help, only now the client started to
also report in messages, every second:
   nfs server 192.168...:/var/sysup/mnt/tmp.6.56160: is alive again

Mounting the same share anew to a different place works fine.

The network babble is this, every second:
   NFS request xid 1678997001 212 getattr fh 0,6/2
   NFS reply xid 1678997001 reply ok 52 getattr ERROR: unk 10052

Forensics: I tried to build openoffice on that share, a couple of
   times. So there was a bit of traffic, and some things may have
   overflown.

There seems to be no way to recover, only crashing the client.




Re: nfs stalls client: nfsrv_cache_session: no session

2022-07-16 Thread Rick Macklem
Peter  wrote:
> Hija,
>  I have a problem with NFSv4:
>
> The configuration:
>   Server Rel. 13.1-RC2
> nfs_server_enable="YES"
> nfs_server_flags="-u -t --minthreads 2 --maxthreads 20 -h ..."
Allowing it to go down to 2 threads is very low. I've never even
tried to run a server with less than 4 threads. Since kernel threads
don't generate much overhead, I'd suggest replacing the
minthreads/maxthreads with "-n 32" for a very small server.
(I didn't write the code that allows number of threads to vary and
 never use that either.)

> mountd_enable="YES"
> mountd_flags="-S -p 803 -h ..."
> rpc_lockd_enable="YES"
> rpc_lockd_flags="-h ..."
> rpc_statd_enable="YES"
> rpc_statd_flags="-h ..."
> rpcbind_enable="YES"
> rpcbind_flags="-h ..."
> nfsv4_server_enable="YES"
> sysctl vfs.nfs.enable_uidtostring=1
> sysctl vfs.nfsd.enable_stringtouid=1
> 
>   Client bhyve Rel. 13.1-RELEASE on the same system
> nfs_client_enable="YES"
> nfs_access_cache="600"
> nfs_bufpackets="32"
> nfscbd_enable="YES"
> 
>   Mount-options: nfsv4,readahead=1,rw,async
I would expect the behaviour you are seeing for "intr" and/or "soft"
mounts, but since you are not using those, I don't know how you
broke the session? (10052 is NFSERR_BADSESSION)
You might want to do "nfsstat -m" on the client to see what options
were actually negotiated for the mount and then check that neither
"soft" nor "intr" are there.

I suspect that the recovery thread in the client (called "nfscl") is
somehow wedged and cannot do the recovery from the bad session,
as well.
A "ps axHl" on the client would be useful to see what the
processes/threads are up to on the client when it is hung.

If increasing the number of nfsd threads in the server doesn't resolve
the problem, I'd guess it is some network weirdness caused by how
the bhyve instance is networked to its host. (I always use bridging
for bhyve instances and do NFS mounts, but I don't work those
mounts hard.)

Btw, "umount -N " on the client will normally get rid
of a hung mount, although it can take a couple of minutes to complete.

rick


Access to the share suddenly stalled. Server reports this in messages,
every second:
   nfsrv_cache_session: no session IPaddr=192.168...

Restarting nfsd and mountd didn't help, only now the client started to
also report in messages, every second:
   nfs server 192.168...:/var/sysup/mnt/tmp.6.56160: is alive again

Mounting the same share anew to a different place works fine.

The network babble is this, every second:
   NFS request xid 1678997001 212 getattr fh 0,6/2
   NFS reply xid 1678997001 reply ok 52 getattr ERROR: unk 10052

Forensics: I tried to build openoffice on that share, a couple of
   times. So there was a bit of traffic, and some things may have
   overflown.

There seems to be no way to recover, only crashing the client.






Re: nfs stalls client: nfsrv_cache_session: no session

2022-07-16 Thread Rick Macklem
Peter  wrote:
> Hija,
>   I have a problem with NFSv4:
> 
> The configuration:
>   Server Rel. 13.1-RC2
> nfs_server_enable="YES"
> nfs_server_flags="-u -t --minthreads 2 --maxthreads 20 -h ..."
> mountd_enable="YES"
> mountd_flags="-S -p 803 -h ..."
> rpc_lockd_enable="YES"
> rpc_lockd_flags="-h ..."
> rpc_statd_enable="YES"
> rpc_statd_flags="-h ..."
> rpcbind_enable="YES"
> rpcbind_flags="-h ..."
> nfsv4_server_enable="YES"
> sysctl vfs.nfs.enable_uidtostring=1
> sysctl vfs.nfsd.enable_stringtouid=1
> 
>   Client bhyve Rel. 13.1-RELEASE on the same system
> nfs_client_enable="YES"
> nfs_access_cache="600"
> nfs_bufpackets="32"
> nfscbd_enable="YES"
> 
>   Mount-options: nfsv4,readahead=1,rw,async
> 
> 
> Access to the share suddenly stalled. Server reports this in messages,
> every second:
>nfsrv_cache_session: no session IPaddr=192.168...
The attached little patch might help. It will soon be in stable/13, but is not
in releng/13.1.
It fixes the only way I am aware of that the client's "nfscl" thread
can get "stuck" on an old session and not do session recovery.
It might be worth applying it to the client.

This still doesn't explain how the session got broken in the first place.

rick

Restarting nfsd and mountd didn't help, only now the client started to
also report in messages, every second:
   nfs server 192.168...:/var/sysup/mnt/tmp.6.56160: is alive again

Mounting the same share anew to a different place works fine.

The network babble is this, every second:
   NFS request xid 1678997001 212 getattr fh 0,6/2
   NFS reply xid 1678997001 reply ok 52 getattr ERROR: unk 10052

Forensics: I tried to build openoffice on that share, a couple of
   times. So there was a bit of traffic, and some things may have
   overflown.

There seems to be no way to recover, only crashing the client.





defunct-releng13.1.patch
Description: defunct-releng13.1.patch


Re: nfs stalls client: nfsrv_cache_session: no session

2022-07-16 Thread Peter
On Sat, Jul 16, 2022 at 01:43:11PM +, Rick Macklem wrote:
! Peter  wrote:
! > Hija,
! >  I have a problem with NFSv4:
! >
! > The configuration:
! >   Server Rel. 13.1-RC2
! > nfs_server_enable="YES"
! > nfs_server_flags="-u -t --minthreads 2 --maxthreads 20 -h ..."
! Allowing it to go down to 2 threads is very low. I've never even
! tried to run a server with less than 4 threads. Since kernel threads
! don't generate much overhead, I'd suggest replacing the
! minthreads/maxthreads with "-n 32" for a very small server.

Okay.
This normally used for building ports, quarterly or so, and writes
go to a local filesystem. Only when something doesn't build and I
start manual tests, then the default /usr/ports nfs share might get
the writes.
With Rel. 13 I think I should move the whole thing to virt-9p
filesystems, on occasion.
 
! > mountd_enable="YES"
! > mountd_flags="-S -p 803 -h ..."
! > rpc_lockd_enable="YES"
! > rpc_lockd_flags="-h ..."
! > rpc_statd_enable="YES"
! > rpc_statd_flags="-h ..."
! > rpcbind_enable="YES"
! > rpcbind_flags="-h ..."
! > nfsv4_server_enable="YES"
! > sysctl vfs.nfs.enable_uidtostring=1
! > sysctl vfs.nfsd.enable_stringtouid=1
! > 
! >   Client bhyve Rel. 13.1-RELEASE on the same system
! > nfs_client_enable="YES"
! > nfs_access_cache="600"
! > nfs_bufpackets="32"
! > nfscbd_enable="YES"
! > 
! >   Mount-options: nfsv4,readahead=1,rw,async
! I would expect the behaviour you are seeing for "intr" and/or "soft"
! mounts, but since you are not using those, I don't know how you
! broke the session? (10052 is NFSERR_BADSESSION)
! You might want to do "nfsstat -m" on the client to see what options
! were actually negotiated for the mount and then check that neither
! "soft" nor "intr" are there.

I killed that client after I found no way out. Normally it looks like this:

nfsv4,minorversion=2,tcp,resvport,nconnect=1,hard,cto,sec=sys,acdirmin=3,
acdirmax=60,acregmin=5,acregmax=60,nametimeo=60,negnametimeo=60,rsize=65536,
wsize=65536,readdirsize=65536,readahead=1,wcommitsize=16777216,timeout=120,
retrans=2147483647

! I suspect that the recovery thread in the client (called "nfscl") is
! somehow wedged and cannot do the recovery from the bad session,

These were present, two of them - I remember seeing "D" flag, but this
seem to always be the case.

! If increasing the number of nfsd threads in the server doesn't resolve
! the problem, I'd guess it is some network weirdness caused by how
! the bhyve instance is networked to its host. (I always use bridging
! for bhyve instances and do NFS mounts, but I don't work those
! mounts hard.)

They attach to a netgraph bridge:
https://gitr.daemon.contact/sysup/tree/subr_virt.sh#n84

! Btw, "umount -N " on the client will normally get rid
! of a hung mount, although it can take a couple of minutes to complete.

Ups, I missed that! I only remembered -f, which didn't work.

Thanks!
PMc