On Fri, 2009-03-13 at 09:37 +0000, Robert Watson wrote: > On Fri, 13 Mar 2009, Nick Withers wrote: > > > I recently installed my first amd64 system (currently running RELENG_7 from > > 2009-03-11) to replace an aged ppc box and have been having dramas with the > > network locking up. > > > > Breaking into the debugger manually and ps-ing shows the network card > > (e.g., > > "[irq20: fxp0+]") in state "LL" in "*tcp_sc_h". It seems the process(es) > > trying to access the card at the time is / are in state "L" in "*tcp". > > > > I thought this may have been something-or-other in the fxp driver, so > > installed an rl card and sadly ran into the issue again. > > > > The console appears unresponsive, but I can get into the debugger (and as > > soon as I have, input I'd sent seems to "go through", e.g., if I hit > > "Enter" > > a couple o' times, nothing happens; when I <Ctrl>+<Alt>+<Esc> into the > > debugger a few login prompts pop up before the debugger output). > > > > A "where" on the fxp / rl process (thread?) gives (transcribed from the > > console): ____ > > Sounds like a lock leak -- if you're running INVARIANTS, then "show allocks" > and "show allchains" would be useful. I've had a report of a TCP lock leak > possibly in tcp_input(), but haven't managed to track it down yet -- this > could well be it as well.
Righto, I'll recompile the kernel with INVARIANTS (hell, I'll go bananas and include everything listed in http://www.freebsd.org/doc/en/books/developers-handbook/kerneldebug-deadlocks.html - anything else I might include?). Sorry for the original double-post, by the way, not quite sure how that happened... I can reproduce this problem relatively easily, by the way (every 3 days, on average). I meant to say this before, too, but it seems to happen a lot more often on the fxp than on rl. I'm sorry to ask what is probably a very simple question, but is there somewhere I should look to get clues on debugging from a manually generated dump? I tried "panic" after manually envoking the kernel debugger but proved highly inept at getting from the dump the same information "ps" / "where" gave me within the debugger live. Ta for your help! > Robert N M Watson > Computer Laboratory > University of Cambridge > > > > > > Tracing PID 31 tid 100030 td 0xffffff00012016e0 > > sched_switch() at sched_switch+0xf1 > > mi_switch() at mi_switch+0x18f > > turnstile_wait() at turnstile_wait+0x1cf > > _mtx_lock_sleep() at _mtx_lock_sleep+0x76 > > syncache_lookup() at syncache_lookup+0x176 > > syncache_expand() at syncache_expand+0x38 > > tcp_input() at tcp_input+0xa7d > > ip_input() at ip_input+0xa8 > > ether_demux() at ether_demux+0x1b9 > > ether_input() at ether_input+0x1bb > > fxp_intr() at fxp_intr+0x233 > > ithread_loop() at ithread_loop+0x17f > > fork_exit() at fork_exit+0x11f > > fork_trampoline() at fork_trampoline+0xe > > ____ > > > > A "where" on a process stuck in "*tcp", in this case "[swi4: clock]", > > gave the somewhat similar: > > ____ > > > > sched_switch() at sched_switch+0xf1 > > mi_switch() at mi_switch+0x18f > > turnstile_wait() at turnstile_wait+0x1cf > > _rw_rlock() at _rw_rlock+0x8c > > ipfw_chk() at ipfw_chk+0x3ab2 > > ipfw_check_out() at ipfw_check_out+0xb1 > > pfil_run_hooks() at pfil_run_hooks+0x9c > > ip_output() at ip_output+0x367 > > syncache_respond() at syncache_respond+0x2fd > > syncache_timer() at syncache_timer+0x15a > > (...) > > ____ > > > > In this particular case, the fxp0 card is in a lagg with rl0, but this > > problem can be triggered with either card on their own... > > > > The scheduler is SCHED_ULE. > > > > I'm not too sure how to give more useful information that this, I'm > > afraid. It's a custom kernel, too... Do I need to supply information on > > what code actually exists at the relevant addresses (I'm not at all > > clued in on how to do this... Sorry!)? Should I chuck WITNESS, > > INVARIANTS et al. in? > > > > I *think* every time this has been triggered there's been a "python2.5" > > process in the "*tcp" state. This machine runs net-p2p/deluge and > > generally has at least 100 TCP connections on the go at any given time. > > > > Can anyone give me a clue as to what I might do to track this down? > > Appreciate any pointers. > > -- > > Nick Withers > > email: n...@nickwithers.com > > Web: http://www.nickwithers.com > > Mobile: +61 414 397 446 > > -- Nick Withers email: n...@nickwithers.com Web: http://www.nickwithers.com Mobile: +61 414 397 446
signature.asc
Description: This is a digitally signed message part