On 20. jun. 2005, at 10.38, Robert Watson wrote:


On Mon, 20 Jun 2005, Eirik Øverby wrote:



Hmm. Looks like a bug in dummynet. ipfw should not be directly re- injecting UDP traffic back into the input path from an outbound path, or it risks re-entering, generating lock order problems, etc. It should be getting dropped into the netisr queue to be processed from the netisr context.



This problem would exist across all 5.4 installations, both i386 and amd64? Would it depend on heavy load, or could it theoretically happen at any time when there's traffic? All three of my fbsd5 servers (dual opteron, dual p3-1ghz, dual p3-700mhz) are experiencing random hangs with ~a few weeks between, impression is that if running single-cpu mode they are all stable. All using dummynet in a comparable manner. Ideas?



Yes. Basically, the network stack avoids recursion in processing for "complicated" packets by deferring processing an offending packet to a thread called the 'netisr'. Whenever the stack reaches a possible recursion point on a packet, it's supposed to queue the packet for processing 'later' in a per-protocol queue, unwind, and then when the netisr runs, pick up and continue processing. In the stack trace you provide, dummynet appears to immediately immediately invoke the in-bound network path from the out-bound network path, walking back into the network stack from the outbound path. This is generally forbidden, for a variety of reasons:

- We do allow the in-bound path to call the out-bound path, so that
  protocols like TCP, and services like NFS can turn around packets
without a context switch. If further recursion is permitted, the stack
  may overflow.

- Both paths may hold network stack locks over calls in either direction -- specifically, we allow protocol locks to be held over calls into the
  socket layer, as the protocol layer drives operation; if a recursive
call is made, deadlocks can occur due to violating the lock order. This
  is what is happening in your case.

Pretty much all network code is entirely architecture-independent, so bugs typically span architectures, although race conditions can sometimes be hard to reproduce if they require precise timing and multiple processors.


So I'm lucky to have seen this one... Great ;)


Is it possible to configure dummynet out of your configuration, and see if the problem goes away?



I'm running a test right now, will let you know in the morning.



Thanks.


I know enough not to call this a "confirmation", but disabling dummynet did indeed allow me to finish the backup. I never made it past 15GBs before, now the full 19GB tar.gz file is done, and the boxes are both still running. The funny thing is - I only disabled dummynet on one of the boxes now - the source of the backup, the box that pushes data. The other box has pretty much 100% the same setup, and is also i386. But as traffic shaping can only happen on outgoing packets, I suppose that makes sense.

I can try re-running the test again if you wish, in order to gain more statistics. It's just too bad it takes a while ;)


/Eirik



Robert N M Watson



_______________________________________________
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "[EMAIL PROTECTED]"

Reply via email to