Hello,

We are facing this problem too with 3 backend VMs and a NFS VM.

Both servers are Google Cloud Compute Engine instances with plenty of CPU
and RAM. We don't see any correlation with high CPU or RAM usage whenever
this happens on any machine. When this happens, apache2 processes on a
backend VM (NFS client machine) wait in state D for a long time (I was only
able to trace a file, and lasted 90s until file unlocked). In a normal
situation, this doesn't happen and requests flow between backend VMs
totally fine.

But when the problem happens, rebooting the client machine doesn't work, as
soon as the load balancer gives it traffic, apache gets stuck again. We got
the machine out of the load balancer to try to isolate it and see logs, but
again as soon as we added it to the load balancer it got stuck. Only
rebooting all VMs and NFS server VM works.

Because we only face this problem on production machines, we need to
quickly restart it and have no time to gather further information to try to
troubleshoot this issue, so we don't have much idea what is going on.

We have the same software versions across all machines and they get updated
every morning at 6 am. We have been facing this problem for some time.
About the time of posting the original question, we migrated our machines
from a very old Debian version to Bookworm by new installation, after that
we started to see this problem occur.

Kernel: Linux web04 6.1.0-26-cloud-amd64 #1 SMP PREEMPT_DYNAMIC Debian
6.1.112-1 (2024-09-30) x86_64 GNU/Linux
nfs-kernel-server: 1:2.6.2-4
apache2: 2.4.62-1~deb12u1
php 8.3 from deb.sury.org: 8.3.12-1+0~20240927.43+debian12~1.gbpad3b8c

So I would like to know if you could troubleshoot it. Or if you can give me
steps to test the problem or give some information to file a bug report, I
would happily do it.

Thank you.

Reply via email to