On 20. jun. 2005, at 10.38, Robert Watson wrote:
On Mon, 20 Jun 2005, Eirik Øverby wrote:
Hmm. Looks like a bug in dummynet. ipfw should not be directly
re- injecting UDP traffic back into the input path from an
outbound path, or it risks re-entering, generating lock order
problems, etc. It should be getting dropped into the netisr queue
to be processed from the netisr context.
This problem would exist across all 5.4 installations, both i386
and amd64? Would it depend on heavy load, or could it
theoretically happen at any time when there's traffic? All three
of my fbsd5 servers (dual opteron, dual p3-1ghz, dual p3-700mhz)
are experiencing random hangs with ~a few weeks between,
impression is that if running single-cpu mode they are all stable.
All using dummynet in a comparable manner. Ideas?
Yes. Basically, the network stack avoids recursion in processing
for "complicated" packets by deferring processing an offending
packet to a thread called the 'netisr'. Whenever the stack reaches
a possible recursion point on a packet, it's supposed to queue the
packet for processing 'later' in a per-protocol queue, unwind, and
then when the netisr runs, pick up and continue processing. In the
stack trace you provide, dummynet appears to immediately
immediately invoke the in-bound network path from the out-bound
network path, walking back into the network stack from the outbound
path. This is generally forbidden, for a variety of reasons:
- We do allow the in-bound path to call the out-bound path, so that
protocols like TCP, and services like NFS can turn around packets
without a context switch. If further recursion is permitted, the
stack
may overflow.
- Both paths may hold network stack locks over calls in either
direction
-- specifically, we allow protocol locks to be held over calls
into the
socket layer, as the protocol layer drives operation; if a recursive
call is made, deadlocks can occur due to violating the lock
order. This
is what is happening in your case.
Pretty much all network code is entirely architecture-independent,
so bugs typically span architectures, although race conditions can
sometimes be hard to reproduce if they require precise timing and
multiple processors.
So I'm lucky to have seen this one... Great ;)
Is it possible to configure dummynet out of your configuration,
and see if the problem goes away?
I'm running a test right now, will let you know in the morning.
Thanks.
I know enough not to call this a "confirmation", but disabling
dummynet did indeed allow me to finish the backup. I never made it
past 15GBs before, now the full 19GB tar.gz file is done, and the
boxes are both still running. The funny thing is - I only disabled
dummynet on one of the boxes now - the source of the backup, the box
that pushes data. The other box has pretty much 100% the same setup,
and is also i386. But as traffic shaping can only happen on outgoing
packets, I suppose that makes sense.
I can try re-running the test again if you wish, in order to gain
more statistics. It's just too bad it takes a while ;)
/Eirik
Robert N M Watson
_______________________________________________
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "[EMAIL PROTECTED]"