It looks on the face of it that AMD is hanging. Perhaps this is
preventing the system from clearing out buffers and causing lockups
on other mounts. AMD could also be causing a deadlock to occur in the
buffer cache (for the same reason loopback mounts can cause deadlocks).
The next time this happens, if the person rebooting freefall can get
a kernel dump (and have a corresponding debug kernel) I may be able to
track it down for sure. Fixing it is another problem, though. Loopback
deadlocks are a big problem under 3.x.
Essentially what occurs under 3.x is that the buffer cache runs out of
buffers (or buffer space) during a client op and tries to synchronously
flush unrelated dirty buffers to clear out some room. It may flush a
write of a client side buffer which runs an rpc to an nfsd running on the
same machine (i.e. via a loopback mount) which then turns around and tries
to allocate a new buffer to issue it's filesystem write, which may in turn
also run out of buffers or buffer space and attempt to flush another
unrelated dirty buffer which could be another client-side buffer. But at
that point nfsd is locked up in getnewbuf(), so the result is a deadlock
that locks up the NFS node entirely (and might NOT lockup the rest of
the machine).
Under 3.x this is a big problem due to the synchronous flush recursion
in getnewbuf(). Under 4.x this is not as big a problem because flushing
is asynchronized by the buf_daemon.
I've been trying to find a solution to the problem for 3.x. I have a
few ideas. I think we can add a flag to the mount structure that
getnewbuf() would set when synchronously flushing a buffer. The flag
would prevent another getnewbuf() call (say one called from nfsd) from
trying to flush buffers from the same client mount, preventing a deadlock.
I have to setup a 3.x box and reproduce the deadlock before I can test
the fix, though, and that will take a bit of time.
-Matt
Oct 15 06:18:08 freefall /kernel: nfs server pid173@freefall:/host: not responding
Oct 15 06:44:49 freefall /kernel: nfs server pid173@freefall:/host: not responding
Oct 15 16:29:50 freefall /kernel: nfs server pid173@freefall:/host: not responding
Oct 15 16:37:26 freefall /kernel: nfs server pid173@freefall:/host: not responding
Oct 15 22:46:08 freefall shutdown: reboot by jdp: Rebooting to unstick NFS
Oct 21 03:10:15 freefall /kernel: nfs server pid173@freefall:/host: not responding
Oct 21 03:34:24 freefall /kernel: nfs server pid173@freefall:/host: not responding
Oct 21 04:38:39 freefall /kernel: nfs server pid173@freefall:/host: not responding
Oct 21 04:46:56 freefall /kernel: nfs server pid173@freefall:/host: not responding
Oct 21 11:56:01 freefall shutdown: reboot by jdp: Rebooting to clear filesystem
related hangs
Oct 22 04:23:41 freefall /kernel: nfs server pid173@freefall:/host: not responding
Oct 22 04:56:57 freefall /kernel: nfs server pid173@freefall:/host: not responding
Oct 22 16:40:55 freefall /kernel: nfs server pid173@freefall:/host: not responding
Oct 22 17:52:34 freefall /kernel: nfs server pid173@freefall:/host: not responding
Oct 23 00:36:56 freefall /kernel: nfs server pid173@freefall:/host: not responding
Oct 23 02:45:57 freefall /kernel: nfs server pid173@freefall:/host: not responding
Oct 23 04:16:57 freefall /kernel: nfs server pid173@freefall:/host: not responding
Oct 23 04:46:56 freefall /kernel: nfs server pid173@freefall:/host: not responding
Oct 23 14:44:22 freefall /kernel: nfs server pid173@freefall:/host: not responding
Oct 23 14:51:53 freefall /kernel: nfs server pid173@freefall:/host: not responding
Oct 23 15:35:55 freefall amd[24839]: /host: mount (amfs_auto_cont): Stale NFS file
handle
Oct 23 15:35:55 freefall /kernel: nfs server pid173@freefall:/host: is alive again
Oct 23 15:38:40 freefall amd[25003]: /host: mount (amfs_auto_cont): Stale NFS file
handle
Oct 23 15:44:05 freefall shutdown: reboot by unfurl:
To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-current" in the body of the message