Re: NICs locking up, "*tcp_sc_h"

Robert Watson Fri, 13 Mar 2009 03:10:45 -0700

On Fri, 13 Mar 2009, Nick Withers wrote:

Sorry for the original double-post, by the way, not quite sure how thathappened...
I can reproduce this problem relatively easily, by the way (every 3 days, onaverage). I meant to say this before, too, but it seems to happen a lot moreoften on the fxp than on rl.
I'm sorry to ask what is probably a very simple question, but is theresomewhere I should look to get clues on debugging from a manually generateddump? I tried "panic" after manually envoking the kernel debugger but provedhighly inept at getting from the dump the same information "ps" / "where"gave me within the debugger live.

If this is, in fact, a TCP input lock leak of some sort, then most likely someparticular property of a host your system talks to, or a network it runs over,triggers this (presumably) unusual edge case -- perhaps a firewall that muckswith TCP in a funny way, etc. Of course, it might be something completelydifferent -- the fact that everything is blocked on *tcp_sc_h and *tcp, simplymeans that something holding TCP locks hasn't released them, and this couldhappen for a number of reasons.

Once you've acquired a crashdump, you can run crashinfo(8), which will producea summary of useful debugging information. There are some things that are abit easier to do in the run-time debugger, such as lock analysis, as therun-time debugger is more up-close and personal with in-kernel datastructures; other things are easier in kgdb, which has complete source codeand C type access. I find kgdb works pretty well for everything but "showmuch what locks are held". Many of our system monitoring tools, including psand portions of netstat, can actually be run on crashdumps to report the stateof the system at the time it crashed -- take a look at the -M and -N commandline arguments, which respectively allow you to point those tools at thecrashdump and at a kernel with debugging symbols (typically kernel.debug orkernel.symbols) matching the kernel that was booted at the time of the crash.


Robert N M Watson
Computer Laboratory
University of Cambridge


Ta for your help!

Robert N M Watson
Computer Laboratory
University of Cambridge


Tracing PID 31 tid 100030 td 0xffffff00012016e0
sched_switch() at sched_switch+0xf1
mi_switch() at mi_switch+0x18f
turnstile_wait() at turnstile_wait+0x1cf
_mtx_lock_sleep() at _mtx_lock_sleep+0x76
syncache_lookup() at syncache_lookup+0x176
syncache_expand() at syncache_expand+0x38
tcp_input() at tcp_input+0xa7d
ip_input() at ip_input+0xa8
ether_demux() at ether_demux+0x1b9
ether_input() at ether_input+0x1bb
fxp_intr() at fxp_intr+0x233
ithread_loop() at ithread_loop+0x17f
fork_exit() at fork_exit+0x11f
fork_trampoline() at fork_trampoline+0xe
____

A "where" on a process stuck in "*tcp", in this case "[swi4: clock]",
gave the somewhat similar:
____

sched_switch() at sched_switch+0xf1
mi_switch() at mi_switch+0x18f
turnstile_wait() at turnstile_wait+0x1cf
_rw_rlock() at _rw_rlock+0x8c
ipfw_chk() at ipfw_chk+0x3ab2
ipfw_check_out() at ipfw_check_out+0xb1
pfil_run_hooks() at pfil_run_hooks+0x9c
ip_output() at ip_output+0x367
syncache_respond() at syncache_respond+0x2fd
syncache_timer() at syncache_timer+0x15a
(...)
____

In this particular case, the fxp0 card is in a lagg with rl0, but this
problem can be triggered with either card on their own...

The scheduler is SCHED_ULE.

I'm not too sure how to give more useful information that this, I'm
afraid. It's a custom kernel, too... Do I need to supply information on
what code actually exists at the relevant addresses (I'm not at all
clued in on how to do this... Sorry!)? Should I chuck WITNESS,
INVARIANTS et al. in?

I *think* every time this has been triggered there's been a "python2.5"
process in the "*tcp" state. This machine runs net-p2p/deluge and
generally has at least 100 TCP connections on the go at any given time.

Can anyone give me a clue as to what I might do to track this down?
Appreciate any pointers.
--
Nick Withers
email: n...@nickwithers.com
Web: http://www.nickwithers.com
Mobile: +61 414 397 446

--
Nick Withers
email: n...@nickwithers.com
Web: http://www.nickwithers.com
Mobile: +61 414 397 446

_______________________________________________
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"

Re: NICs locking up, "*tcp_sc_h"

Reply via email to