Need clues on possible brokenness.

Chad Schwartz Tue, 10 Apr 2001 11:13:24 -0700
Hiya, list.

Think i've found a rather nasty bug in the kernel, and I need some clues
as to where to look for the issue.

Stats:

Quad Xeon (PIII core) 700mhz machine (1mb cache on each)
4gb RAM
5x36gb SCSI disks - on a DAC1100 RAID controller
3 EEPro 100 cards

The box functions as a database server that runs at about 40% load on
each CPU, and about 1.5gb memory usage.

Kernel: 2.2.18pre11-va2.0smp (Although completely reproducible on stock
2.2.18)

If dmesg output from kernel - or any other info is required, i'd be more
than happy to provide it.


Problem:

Box appears to stop responding to network requests for 30 seconds at a
time.  it appears to be happening when we get this error:

wait_on_bh, CPU 0:
irq:  0 [0 0]
bh:   1 [0 0]
<[c010bb29]> <[c011d07b]> <[c011d1ed]> <[c0116658]> <[c01099fc]>

it LOOKS like the virtaddr's provided are a call trace, however, I can't
be sure - as SOME of the addresses don't show up as a ksym..

the problem LOOKS to be, from my perspective (And light code reading) -
that the function synchronize_bh() is called SOMEWHERE, and then,
wait_on_bh() is called from that.  It also appears that wait_on_bh() loops
through MAXCOUNT times (100000000 times), and fails - therefore exiting
the function.  (it also appears that the 1 global interrupt that is OPEN
is a TIMER interrupt.)

Can SOMEONE give me a clue as to where to start looking for this problem -
and/or perhaps some input from people who work on this code? (The really
confusing thing about this, is that the wait_for_bh() function should NOT
take 30 seconds to jump through the maximum loop count.)

Thanks in advance,

Chad


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Need clues on possible brokenness.

Reply via email to