The tricky part for me is that client was regularly changing, so I can't
confidently say when did errors start appearing, it's just very
suspicious that it needs high load (as new host and higher network
bandwidth made the issue more frequent), and uploading to the server as
just pure downloading doesn't seem to be a problem even if cached data
is getting sent at full bandwidth for minutes.

Moved the server to 24.04, but I've also moved some I/O heavy tasks to
it so there would be less need of uploading. Client was on 23.10 and I'm
still holding back on upgrading for some more weeks.

Can't say a whole lot about the current situation as I'm not uploading much 
anymore to avoid the issue, but I actually ran into a hanging issue a few days 
ago, I just didn't have time to debug it, but the server didn't want to 
gracefully restart, so ended up hard rebooting.
I believe it was the first time since moving I/O heavy tasks, wanted to upload 
a few hundred GiB of data back to the server which was downloaded from there a 
while ago without problems. Otherwise light I/O doesn't seem to run into this 
problem, like the occasional backup to the server is fine, but that rarely 
saturates the network, and likely completely fits into the page cache almost 
every time.

A few hopefully helpful points for reproducing the problem:
- As mentioned multiple times, download alone seems to be unaffected, uploading 
is what should be stressed, and I suspect that either there's no need to 
download at the same time, or just casual filesystem browsing is a good enough 
load.
- A fast client with high bandwidth is key. Ran into this issue a couple times 
with an older host on 1 Gb/s, but a new fast host with 2.5 Gb/s made the issue 
appear significantly more frequently.
- Likely doesn't matter how the link gets saturated, but I either processed 
files cached on the server (mixed R/W), or uploaded cached files (fast SSD 
should be fine too), meaning that the bottleneck was always the network at 
least while the caches were large enough.
- Files were large, so there wasn't any stopping for fiddling with metadata as 
it would happen with small files, and the page cache was often exhausted. The 
target was a single HDD the majority of the time which often meant that writes 
started blocking (100-ish MiB/s HDD catching up with close to 250 MiB/s data), 
occasionally making the hosts freeze as the kernel's background I/O handling is 
still bad, we just pretend the issue is gone with SSDs being fast enough not to 
run into this. The page cache draining freezes may be good at exposing race 
conditions.

It may be more efficient to start looking for what's causing the "RPC: Could 
not send backchannel reply error: -110" log spam which might be related. The 
lockup may take significant time to catch while that kernel message showed up 
quite frequently.
Even now I have plenty of those lines without experiencing issues and not even 
uploading much, mostly just downloading large files.

Some extra info which may or may not matter:
- The server hardware is quite weak with an old 4 core Broadwell CPU, possibly 
helping to expose race condition problems
- All file systems are Btrfs with noatime,discard=async,compress-force=zstd , 
the later part surely adding more load
- LUKS is used everywhere, also adding some extra load
- There's a Btrfs (on LUKS) image mounted over NFS (with not a whole lot of 
usage though)

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/2062568

Title:
  nfsd gets unresponsive after some hours of operation

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/nfs-utils/+bug/2062568/+subscriptions


-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

Reply via email to