[Kernel-packages] [Bug 1887607] Re: NFSv4.1: Interrupted connections cause high bandwidth RPC ping-pong between client and server

Matthew Ruffell Tue, 18 Aug 2020 22:31:00 -0700

I installed 4.15.0-114-generic from -proposed to my test client machine,
which is a Ubuntu 18.04 Desktop VM.


I mounted two NFS shares, one with sec=sys, and the other with
sec=krb5p. I then opened each share up in separate tabs in Nautilus.

I then CUT a file from the sec=sys share, and PASTED it into the
sec=krb5p share. The file pasted to the new share correctly, and was
removed from the original share.

I repeated this several times, and each file movement was successful
with no hangs.

I then moved all files from the sec=krb5p share, back to the sec=sys
share, selected all files, then CUT and PASTED them to the sec=krb5p
share.

All files moved without the system hanging. I repeated this a number of
times, and the system functions correctly.

There is an occasional delay when pasting some files, that lasts for no
more than a few seconds, which happens when a file copy is interrupted,
but the NFS subsystem recovers, likely finding the correct slot and
sequence number, and the file moves successfully. The delay is likely
due to the NFS client needing a few round trips to the server to
discover the correct slot and sequence number.

The hang is fixed, and I am happy to mark this as verified.

** Tags removed: verification-needed-bionic
** Tags added: verification-done-bionic

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1887607

Title:
  NFSv4.1: Interrupted connections cause high bandwidth RPC ping-pong
  between client and server

Status in linux package in Ubuntu:
  Fix Released
Status in linux source package in Bionic:
  Fix Committed

Bug description:
  BugLink: https://bugs.launchpad.net/bugs/1887607

  [Impact]

  There is a bug in NFS v4.1 that causes a large amount of RPC calls
  between a client and server when a previous RPC call is interrupted.
  This uses a large amount of bandwidth and can saturate the network.

  The symptoms are so:

  * On NFS clients:
  Attempts to access mounted NFS shares associated with the affected server 
block indefinitely.

  * On the network:
  A storm of repeated RPCs between NFS client and server uses a lot of 
bandwidth. Each RPC is acknowledged by the server with an 
NFS4ERR_SEQ_MISORDERED error.

  * Other NFS clients connected to the same NFS server:
  Performance drops dramatically.

  This occurs during a "false retry", when a client attempts to make a
  new RPC call using a slot+sequence number that references an older,
  cached call. This happens when a user process interrupts an RPC call
  that is in progress.

  I had previously fixed this for Disco in bug 1828978, and now a
  customer has run into the issue in Bionic. A reproducer is supplied in
  the testcase section, which was something missing from bug 1828978,
  since we never determined how the issue actually occurred back then.

  [Fix]

  This was fixed in 5.1 upstream with the below commit:

  commit 3453d5708b33efe76f40eca1c0ed60923094b971
  Author: Trond Myklebust <trond.mykleb...@hammerspace.com>
  Date: Wed Jun 20 17:53:34 2018 -0400
  Subject: NFSv4.1: Avoid false retries when RPC calls are interrupted

  The fix is to pre-emptively increment the sequence number if an RPC
  call is interrupted, and to address corner cases we interpret the
  NFS4ERR_SEQ_MISORDERED error as a sign we need to locate an
  appropriate sequence number between the value we sent, and the last
  successfully acked SEQUENCE call.

  The commit also requires two fixup commits, which landed in 5.5 and
  5.8-rc6 respectively:

  commit 5c441544f045e679afd6c3c6d9f7aaf5fa5f37b0
  Author: Trond Myklebust <trond.mykleb...@hammerspace.com>
  Date:   Wed Nov 13 08:34:00 2019 +0100
  Subject: NFSv4.x: Handle bad/dead sessions correctly in 
nfs41_sequence_process()

  commit 913fadc5b105c3619d9e8d0fe8899ff1593cc737
  Author: Anna Schumaker <anna.schuma...@netapp.com>
  Date:   Wed Jul 8 10:33:40 2020 -0400
  Subject: NFS: Fix interrupted slots by sending a solo SEQUENCE operation

  Commits 3453d5708b33efe76f40eca1c0ed60923094b971 and
  913fadc5b105c3619d9e8d0fe8899ff1593cc737 require small backports to
  bionic, as struct rpc_cred changed to const struct cred in 5.0, and
  the backports swap them back to struct rpc_cred since that is how 4.15
  works.

  [Testcase]

  You will need four machines. The first, is a kerberos KDC. Set up
  Kerberos correctly and create new service principals for the NFS
  server and for the client. I used: nfs/nfskerb.mydomain.com and
  nfs/client.mydomain.com.

  The second machine will be a NFS server with the krb5p share. Add the nfs 
server kerberos keys to the system's keytab, and set up a NFS server that 
exports a directory with sec=krb5p. Example export:
  /mnt/secretfolder *.mydomain.com(rw,sync,no_subtree_check,sec=krb5p)

  The third machine is a regular NFS server. Export a directory with normal 
sec=sys security. Example export:
  /mnt/sharedfolder *.mydomain.com(rw,sync)

  The fourth is a desktop machine. Add the client kerberos keys to the system's 
keytab. Mount both NFS shares, making sure to use the NFS v4.2 protocol. I used 
the commands:
  mount -t nfs4 nfskerb.mydomain.com:/mnt/secretfolder /mnt/secretfolder_client/
  mount -t nfs4 nfs.mydomain.com:/mnt/sharedfolder /mnt/sharedfolder_client

  Check "mount -l" to ensure that NFS v4.2 is used:
  nfskerb.mydomain.com:/mnt/secretfolder on /mnt/secretfolder_client type nfs4 
(rw,relatime,vers=4.2,<...>,sec=krb5p,<...>)
  nfs.mydomain.com:/mnt/sharedfolder on /mnt/sharedfolder_client type nfs4 
(rw,relatime,vers=4.2,<...>,sec=sys,<...>)

  Generate some files full of random data. I found 20MB from /dev/random
  works great.

  Open each NFS share up in tabs in Nautilus. Copy the random data files
  to the sec=sys NFS share. When they are done, one at a time cut and
  then paste the file into the sec=krb5p NFS share. The bug will trigger
  either on the first, or subsequent tries, but less than 10 tries are
  needed usually.

  There is a test kernel available in the following PPA:
  https://launchpad.net/~mruffell/+archive/ubuntu/sf285439-test

  If you install the test kernel, files will cut and paste correctly,
  and NFS will work as expected.

  [Regression Potential]

  The changes are localised to NFS v4.1 and v4.2 only, and other
  versions of NFS are not affected. If a regression occurs, users can
  downgrade NFS versions to v4.0 or v3.x until a fix is made.

  The changes only impact when connections are interrupted, and under
  typical blue sky scenarios would not be invoked.

  There have been several attempts to fix this in the past, starting
  with f9312a541050 "NFSv4.1: Fix the client behaviour on
  NFS4ERR_SEQ_FALSE_RETRY", and stemming to the commit mentioned in the
  fix section, along with its two other fixup commits. This seems to be
  an ongoing issue where edge cases crop up. I won't be surprised if
  there are further commits down the line.

  [Other Info]

  When I first submitted this fix for SRU, I believed that the fix was:

  commit 02ef04e432babf8fc703104212314e54112ecd2d
  Author: Chuck Lever <chuck.le...@oracle.com>
  Date: Mon Feb 11 11:25:25 2019 -0500
  Subject: NFS: Account for XDR pad of buf->pages

  This is not the case. This was a false positive fix. What it did was
  break NFSv4 GETACL and FS_LOCATIONS requests. When you tried to
  reproduce, the calls were never made since they were broken, and thus
  could not be interrupted, and cutting and pasting files worked fine.

  When you applied the fixup commit
  29e7ca715f2a0b6c0a99b1aec1b0956d1f271955 to fix NFSv4 GETACL and
  FS_LOCATIONS requests, the problem returns, as GETACL and FS_LOCATIONS
  are free to be interrupted and start a high bandwidth ping pong.

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1887607/+subscriptions

-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to     : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp

[Kernel-packages] [Bug 1887607] Re: NFSv4.1: Interrupted connections cause high bandwidth RPC ping-pong between client and server

Reply via email to