Re: System stalls (for 15-30 minutes) during savelog

Carsten Aulbert Tue, 12 May 2009 06:57:59 -0700

Hi again,

Seems we have found the problem. First in the morning we have finally
seen an error message in the logs:
May 12 09:04:28 gpu01 kernel: [ 3000.040007] ------------[ cut here
]------------
May 12 09:04:28 gpu01 kernel: [ 3000.040011] WARNING: at
net/sched/sch_generic.c:219 dev_watchdog+0x121/0x1b8()
May 12 09:04:28 gpu01 kernel: [ 3000.040013] NETDEV WATCHDOG: eth0
(sky2): transmit timed out
May 12 09:04:28 gpu01 kernel: [ 3000.040015] Modules linked in:
ipmi_devintf ipmi_watchdog ipmi_poweroff ipmi_msghandler i2c_i801
i2c_core sky2
May 12 09:04:28 gpu01 kernel: [ 3000.040025] Pid: 0, comm: swapper Not
tainted 2.6.27.21-atlas-generic-noinitrd #1
May 12 09:04:28 gpu01 kernel: [ 3000.040027]
May 12 09:04:28 gpu01 kernel: [ 3000.040028] Call Trace:
May 12 09:04:28 gpu01 kernel: [ 3000.040030]  <IRQ>
[<ffffffff80237378>] warn_slowpath+0xb4/0xdc
May 12 09:04:28 gpu01 kernel: [ 3000.040037]  [<ffffffff804d2d00>]
sk_filter+0x10/0x80
May 12 09:04:28 gpu01 kernel: [ 3000.040040]  [<ffffffff804e7b1a>]
ip_route_input+0x63e/0xedf
May 12 09:04:28 gpu01 kernel: [ 3000.040044]  [<ffffffff803bf7b9>]
__next_cpu+0x19/0x26
May 12 09:04:28 gpu01 kernel: [ 3000.040048]  [<ffffffff802302e7>]
find_busiest_group+0x315/0x7c3
May 12 09:04:28 gpu01 kernel: [ 3000.040051]  [<ffffffff80232203>]
try_to_wake_up+0x165/0x177
May 12 09:04:28 gpu01 kernel: [ 3000.040054]  [<ffffffff8022f0ce>]
enqueue_task_fair+0xd8/0x130
May 12 09:04:28 gpu01 kernel: [ 3000.040057]  [<ffffffff804df6ed>]
dev_watchdog+0x121/0x1b8
May 12 09:04:28 gpu01 kernel: [ 3000.040060]  [<ffffffff80232203>]
try_to_wake_up+0x165/0x177
May 12 09:04:28 gpu01 kernel: [ 3000.040062]  [<ffffffff804df5cc>]
dev_watchdog+0x0/0x1b8
May 12 09:04:28 gpu01 kernel: [ 3000.040065]  [<ffffffff8023fa06>]
run_timer_softirq+0x16e/0x1ee
May 12 09:04:28 gpu01 kernel: [ 3000.040069]  [<ffffffff8024c075>]
ktime_get_ts+0x21/0x49
May 12 09:04:28 gpu01 kernel: [ 3000.040072]  [<ffffffff8023bfad>]
__do_softirq+0x6a/0xda
May 12 09:04:28 gpu01 kernel: [ 3000.040075]  [<ffffffff8021163c>]
call_softirq+0x1c/0x28
May 12 09:04:28 gpu01 kernel: [ 3000.040078]  [<ffffffff802130fb>]
do_softirq+0x3c/0x81
May 12 09:04:28 gpu01 kernel: [ 3000.040082]  [<ffffffff80220326>]
smp_apic_timer_interrupt+0x8e/0xa7
May 12 09:04:28 gpu01 kernel: [ 3000.040085]  [<ffffffff80210e43>]
apic_timer_interrupt+0x83/0x90
May 12 09:04:28 gpu01 kernel: [ 3000.040086]  <EOI>
[<ffffffff802170e2>] mwait_idle+0x3c/0x46
May 12 09:04:28 gpu01 kernel: [ 3000.040092]  [<ffffffff8020ee32>]
cpu_idle+0x91/0xd1
May 12 09:04:28 gpu01 kernel: [ 3000.040094]
May 12 09:04:28 gpu01 kernel: [ 3000.040096] ---[ end trace
da19323bcd799bc5 ]---
May 12 09:04:28 gpu01 kernel: [ 3000.040098] sky2 eth0: tx timeout
May 12 09:04:28 gpu01 kernel: [ 3000.048993] sky2 eth0: transmit ring
348 .. 308 report=348 done=348
May 12 09:04:28 gpu01 kernel: [ 3000.049017] sky2 eth0: disabling interface
May 12 09:04:28 gpu01 kernel: [ 3000.053439] sky2 eth0: enabling interface
May 12 09:04:31 gpu01 kernel: [ 3003.153938] sky2 eth0: Link is up at
1000 Mbps, full duplex, flow control rx


and it seems that there is a autosensing problem between the switch and
the NIC. it should bee tx:off rx:off. When I put a direct link between
the FAI server and this node the install went fine.

Now I've to find a suitable location where to tell the NICs how to react.

Should this go into the initrd and/or the fai config area?

Any suggestions?

Cheers

Carsten

Re: System stalls (for 15-30 minutes) during savelog

Antwort per Email an