Hi again, Seems we have found the problem. First in the morning we have finally seen an error message in the logs: May 12 09:04:28 gpu01 kernel: [ 3000.040007] ------------[ cut here ]------------ May 12 09:04:28 gpu01 kernel: [ 3000.040011] WARNING: at net/sched/sch_generic.c:219 dev_watchdog+0x121/0x1b8() May 12 09:04:28 gpu01 kernel: [ 3000.040013] NETDEV WATCHDOG: eth0 (sky2): transmit timed out May 12 09:04:28 gpu01 kernel: [ 3000.040015] Modules linked in: ipmi_devintf ipmi_watchdog ipmi_poweroff ipmi_msghandler i2c_i801 i2c_core sky2 May 12 09:04:28 gpu01 kernel: [ 3000.040025] Pid: 0, comm: swapper Not tainted 2.6.27.21-atlas-generic-noinitrd #1 May 12 09:04:28 gpu01 kernel: [ 3000.040027] May 12 09:04:28 gpu01 kernel: [ 3000.040028] Call Trace: May 12 09:04:28 gpu01 kernel: [ 3000.040030] <IRQ> [<ffffffff80237378>] warn_slowpath+0xb4/0xdc May 12 09:04:28 gpu01 kernel: [ 3000.040037] [<ffffffff804d2d00>] sk_filter+0x10/0x80 May 12 09:04:28 gpu01 kernel: [ 3000.040040] [<ffffffff804e7b1a>] ip_route_input+0x63e/0xedf May 12 09:04:28 gpu01 kernel: [ 3000.040044] [<ffffffff803bf7b9>] __next_cpu+0x19/0x26 May 12 09:04:28 gpu01 kernel: [ 3000.040048] [<ffffffff802302e7>] find_busiest_group+0x315/0x7c3 May 12 09:04:28 gpu01 kernel: [ 3000.040051] [<ffffffff80232203>] try_to_wake_up+0x165/0x177 May 12 09:04:28 gpu01 kernel: [ 3000.040054] [<ffffffff8022f0ce>] enqueue_task_fair+0xd8/0x130 May 12 09:04:28 gpu01 kernel: [ 3000.040057] [<ffffffff804df6ed>] dev_watchdog+0x121/0x1b8 May 12 09:04:28 gpu01 kernel: [ 3000.040060] [<ffffffff80232203>] try_to_wake_up+0x165/0x177 May 12 09:04:28 gpu01 kernel: [ 3000.040062] [<ffffffff804df5cc>] dev_watchdog+0x0/0x1b8 May 12 09:04:28 gpu01 kernel: [ 3000.040065] [<ffffffff8023fa06>] run_timer_softirq+0x16e/0x1ee May 12 09:04:28 gpu01 kernel: [ 3000.040069] [<ffffffff8024c075>] ktime_get_ts+0x21/0x49 May 12 09:04:28 gpu01 kernel: [ 3000.040072] [<ffffffff8023bfad>] __do_softirq+0x6a/0xda May 12 09:04:28 gpu01 kernel: [ 3000.040075] [<ffffffff8021163c>] call_softirq+0x1c/0x28 May 12 09:04:28 gpu01 kernel: [ 3000.040078] [<ffffffff802130fb>] do_softirq+0x3c/0x81 May 12 09:04:28 gpu01 kernel: [ 3000.040082] [<ffffffff80220326>] smp_apic_timer_interrupt+0x8e/0xa7 May 12 09:04:28 gpu01 kernel: [ 3000.040085] [<ffffffff80210e43>] apic_timer_interrupt+0x83/0x90 May 12 09:04:28 gpu01 kernel: [ 3000.040086] <EOI> [<ffffffff802170e2>] mwait_idle+0x3c/0x46 May 12 09:04:28 gpu01 kernel: [ 3000.040092] [<ffffffff8020ee32>] cpu_idle+0x91/0xd1 May 12 09:04:28 gpu01 kernel: [ 3000.040094] May 12 09:04:28 gpu01 kernel: [ 3000.040096] ---[ end trace da19323bcd799bc5 ]--- May 12 09:04:28 gpu01 kernel: [ 3000.040098] sky2 eth0: tx timeout May 12 09:04:28 gpu01 kernel: [ 3000.048993] sky2 eth0: transmit ring 348 .. 308 report=348 done=348 May 12 09:04:28 gpu01 kernel: [ 3000.049017] sky2 eth0: disabling interface May 12 09:04:28 gpu01 kernel: [ 3000.053439] sky2 eth0: enabling interface May 12 09:04:31 gpu01 kernel: [ 3003.153938] sky2 eth0: Link is up at 1000 Mbps, full duplex, flow control rx
and it seems that there is a autosensing problem between the switch and the NIC. it should bee tx:off rx:off. When I put a direct link between the FAI server and this node the install went fine. Now I've to find a suitable location where to tell the NICs how to react. Should this go into the initrd and/or the fai config area? Any suggestions? Cheers Carsten