[Kernel-packages] [Bug 1828978] Re: NFSv4.1: Interrupted connections cause high bandwidth RPC ping-pong between client and server

Matthew Ruffell Sun, 17 Nov 2019 22:36:03 -0800

** Tags removed: verification-needed-disco
** Tags added: verification-done-disco


-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1828978

Title:
  NFSv4.1: Interrupted connections cause high bandwidth RPC ping-pong
  between client and server

Status in linux package in Ubuntu:
  Fix Released
Status in linux source package in Disco:
  Fix Committed

Bug description:
  BugLink: https://bugs.launchpad.net/bugs/1828978

  [Impact]

  There is a bug in NFS v4.1 that causes a large amount of RPC calls
  between a client and server when a previous RPC call is interrupted.
  This uses a large amount of bandwidth and can saturate the network.

  The symptoms are so:

  * On NFS clients:
  Attempts to access mounted NFS shares associated with the affected server 
block indefinitely.
   
  * On the network:
  A storm of repeated RPCs between NFS client and server uses a lot of 
bandwidth. Each RPC is acknoledged by the server with an NFS4ERR_SEQ_MISORDERED 
error.

  * Other NFS clients connected to the same NFS server:
  Performance drops dramatically.

  This occurs during a "false retry", when a client attempts to make a
  new RPC call using a slot+sequence number that references an older,
  cached call. This happens when a user process interrupts an RPC call
  that is in progress.

  [Fix]

  This was fixed in 5.1 upstream with the below commit:

  commit 3453d5708b33efe76f40eca1c0ed60923094b971
  Author: Trond Myklebust <trond.mykleb...@hammerspace.com>
  Date:   Wed Jun 20 17:53:34 2018 -0400
  Subject: NFSv4.1: Avoid false retries when RPC calls are interrupted

  The fix is to pre-emptively increment the sequence number if an RPC
  call is interrupted, and to address corner cases we interpret the
  NFS4ERR_SEQ_MISORDERED error as a sign we need to locate an
  approperiate sequence number between the value we sent, and the last
  successfully acked SEQUENCE call.

  Commit 3453d5708b33efe76f40eca1c0ed60923094b971 is a clean cherry-pick
  to disco.

  [Testcase]

  This is difficult to reproduce on test systems, and has instead been
  verified on a production NFS v4.1 system in a customer environment.
  This server is heavily trafficked and has a large number of different
  NFS clients connected to it.

  I have built a test kernel that contains the above patch, and also
  patches for Bug 1842037. It is available here:

  https://launchpad.net/~mruffell/+archive/ubuntu/sf241068-test

  Note that the above kernel is for bionic HWE, and not explicitly
  disco.

  Discussion about the patch validation can be found at the bottom of
  Bug 1842037.

  On unpatched kernels, expect to see the symptoms mentioned in Impact,
  and on patched systems, everything working as intended.

  [Regression Potential]

  The changes are localised to NFS v4.1 only, and other versions of NFS
  are not affected. If a regression occurs, users can downgrade NFS
  versions to v4.0 or v3.x until a fix is made.

  The changes only impact when connections are interrupted, and under
  typical blue sky scenarios would not be invoked.

  There have been no fixup commits or commits near the requested commit
  in newer kernels, which points to this commit fixing the issue, and
  adopted by the community.

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1828978/+subscriptions

-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to     : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp

[Kernel-packages] [Bug 1828978] Re: NFSv4.1: Interrupted connections cause high bandwidth RPC ping-pong between client and server

Reply via email to