On Tue, Jan 31, 2006 at 02:09:05PM +0100, mickey wrote: > On Tue, Jan 31, 2006 at 12:24:48PM +1100, Nicholas Young wrote: > > On Mon, Jan 30, 2006 at 04:40:29PM +0100, mickey wrote: > > > On Mon, Jan 30, 2006 at 05:46:17PM +1100, Nicholas Young wrote: > > > > Hello > > > re > > > > > > > When booting with the standard kernel everything works fine and I > > > > can login/use the machine, run stress without any errors. > > > > > > > > When booting with the MP kernel it will get to mounting the drive and > > > > freeze, partial boot log below of where the error occurs. > > > > > > > > I have tested this with AMD64/i386 of 3.8 and the AMD64 snapshot > > > > 24 Jan 2006. > > > > > > (i'm not sure what you mean by amd64/i386 ;) > > > > I tested it with 3.8 AMD64 and then 3.8 i386. > > > > > can you give a try to i386 (not amd64) fresh snap plz? > > > > I loaded the i386 snapshot from 24 Jan 06 and it had the same final > > error. > > > > It also gave during the boot: > > biomask 0 netmask 0 ttymask 0 > > ioapic0: pin 3 shares different IPL interrupts (40..50), degraded > > performance > > ioapic0: pin 5 shares different IPL interrupts (40..90), degraded > > performance > > pctr: user-level counter enabled > > apm0: disconnected > > wd0(pciide2:0:0): timeout > > type: ata > > c_bcount: 512 > > c_skip: 0 > > wd0(pciide2:0:0): timeout > > type: ata > > c_bcount: 512 > > c_skip: 0 > > can you try non-smp kernel plz? > also if that fails try disable pcibios in ukc> . > same plz try w/ smp kernel if disabling pcibios change anything. > (to get to ukc do boot -c and then "disable pcibios" at UKC> prompt) >
Thanks for the reply. With the non-smp kernel there is no error and it boots to the login prompt and i can use the machine as per normal. This is with pcibios enabled. For the smp kernel disabling the pcibios made no obvious difference it still ends with the same wd0(pciide2:0:0): timeout issue and then freezes. It looks like it is going into ltsleep+0x6d forever. At the moment I think a process is asking for a read from the SATA drive for the disklable,this appears to be what is happening when it works correctly. The read either fails and timesout or is completed but not correctly notifying the calling process for some reason which then ends up in wdctimeout(). I think this implies some problem with the SATA driver for the NF4 chipset in SMP mode. This is based on the fact it works correctly with a IDE drive for SMP and non-SMP kernels and with SATA drives for non-SMP kernels. (all on the same hardware) Does this seem like a reasonable idea of where the problem could be? Anything else I should/could try? I can get a better output from ddb if someone can give me instructions on what to do. At the moment I am trying to understand how the wdc code works and what changes when reading from a device when using a SMP kernel. I am not sure if this can help, however I have run the following to try and get an idea of what is going on when the system ends up in wdctimeout(). boot> boot bsd.mp -d booting hd0a:bsd.mp: 5000232+873256 [52+258016+239165]=0x613718 entry point at 0x100120 . [ using 497608 bytes of bsd ELF symbol table ] Stopped at Debugger+0x4: leave ddb{0}> break wdctimeout ddb{0}> c ... fd0 at fdc0 drive 0: 1.44MB 80 cyl, 2 head, 18 sec biomask 0 netmask 0 ttymask 0 ioapic0: pin 3 shares different IPL interrupts (40..50), degraded performance ioapic0: pin 5 shares different IPL interrupts (40..90), degraded performance pctr: user-level cycle counter enabled apm0: disconnected Breakpoint at wdctimeout: pushl %ebp ddb{0}> boot reboot panic: wdc_exec_command: polled command not done Stopped at Debugger+0x4: leave RUN AT LEAST 'trace' AND 'ps' AND INCLUDE OUTPUT WHEN REPORTING THIS PANIC! DO NOT EVEN BOTHER REPORTING THIS WITHOUT INCLUDING THAT INFORMATION! ddb{0}> trace Debugger(d18a55a4,d7e2f060,e9058bfc,d7e2f060,e9058c44) at Debugger+0x4 panic(d04f1340,d7e2f060,0,d05acce0,d1872100) at panic+0x63 wdc_exec_command(d18a50d4,e9058c44,e9058c6c,d016da36) at wdc_exec_command+0x10a wd_flushcache(d18a7800,10,b0,3e,d1877890) at wd_flushcache+0x67 wd_shutdown(d18a7800,e9058d20,e9058ccc,d01f0d35) at wd_shutdown+0x10 dohooks(d05ae900,1,e9058cfc,d01f0dd1) at dohooks+0x5e boot(4804,1,e9058d1c,0,0) at boot+0x55 db_boot_poweroff_cmd(d034a2e0,0,ffffffff,e9058d24,d05acee0) at db_boot_poweroff _cmd db_command(d05acee0,d05acd00,e9058e2c,d01efd61,e9058e08) at db_command+0xff db_command_loop(1,e9058ec4,e9058e6c,d03465c4,1) at db_command_loop+0x9c db_trap(1,0,e9058e6c,d0346569,e9058ec4) at db_trap+0x86 kdb_trap(1,0,e9058ec4,d05ae928) at kdb_trap+0xe8 trap() at trap+0xb9 --- trap (number 1) --- wdctimeout(58,0,10,10,e9057000) at wdctimeout+0x1 Bad frame pointer: 0xe9058f20 ddb{0}> ps PID PPID PGRP UID S FLAGS WAIT COMMAND 7 0 0 0 2 0x2100604 pfpurge 6 0 0 0 3 0x2100204 timeout sensors 5 0 0 0 3 0x2100204 usbevt usb1 4 0 0 0 3 0x2100204 usbtsk usbtask 3 0 0 0 3 0x2100204 usbevt usb0 2 0 0 0 3 0x2100204 kmalloc kmthread 1 0 0 0 3 0x2000004 initexec swapper 0 -1 0 0 3 0x2080204 wdccmd swapper -- Nich