Folks, I kicked off a thread just before the holidays regarding some problems we are having with an Intel SRCU42X RAID controller in a dual processor production server originally under 5.3-STABLE and now under 4.10-STABLE. The thread ran out of steam, with no resolution to the problem, but I'm hoping that with extra information I might get to the bottom of it.
Basically, after some amount of uptime the kernel will emit a "amr0: Bad slot x completed" message and pretty soon after this the box goes into a partially unresponsive state forcing us to reboot it. So far the only thing triggering the problem is the nightly jobs, where the amount of IO is higher than during the day. Before deployment, we tested the box with 5.3-STABLE and managed to trigger the problem twice. This forced us to try 4.10-STABLE which was fine in testing and for a number of weeks after deployment. However, just before new year we saw our first Bad Slot and crash under 4.10. Since then it has happened 3 more times. We have upgraded the firmware to the latest version available from Intel, and if anything this has made the problem worse. We're beginning to suspect a dud card but could do with a few "works fine for us" style posts to build confidence in the support for the card under FreeBSD. The amr driver doesn't explicitly support the card, but it's a rebadged MegaRAID 320 as far as we can tell. Scott Long has posted to say that he is seeing similar problems, but I'm wondering if it really is a problem with the driver, wouldn't more of you be having problems? The machine had 3 disks configured as a single RAID5 array. A fourth disk is configured as a hot-standby. The card is equipped with 128Mb of battery-backed cache. Write-back caching is enabled on the card. Read-ahead caching is enabled in non-adaptive mode. Is anyone else using a SRCU42X RAID card and seeing similar problems to ours? What about other cards supported by the amr driver? We could just change the controller, but the problem we are having is pretty random and the feedback gap between change and outcome is long. We'd like to have more information to work with before deciding the next step. uname -a FreeBSD xxxxx 4.10-STABLE FreeBSD 4.10-STABLE #7: Tue Nov 16 12:50:42 GMT 2004 dmesg Copyright (c) 1992-2004 The FreeBSD Project. Copyright (c) 1979, 1980, 1983, 1986, 1988, 1989, 1991, 1992, 1993, 1994 The Regents of the University of California. All rights reserved. FreeBSD 4.10-STABLE #7: Tue Nov 16 12:50:42 GMT 2004 [EMAIL PROTECTED]:/usr/obj/usr/src/sys/POOH Timecounter "i8254" frequency 1193182 Hz CPU: Intel(R) Xeon(TM) CPU 3.20GHz (3189.72-MHz 686-class CPU) Origin = "GenuineIntel" Id = 0xf25 Stepping = 5 Features=0xbfebfbff<FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,CLFLUSH,DTS,ACPI,MMX,FXSR,SSE,SSE2,SS,HTT,TM,PBE> Hyperthreading: 2 logical CPUs real memory = 4026466304 (3932096K bytes) Programming 24 pins in IOAPIC #0 IOAPIC #0 intpin 2 -> irq 0 Programming 24 pins in IOAPIC #1 Programming 24 pins in IOAPIC #2 FreeBSD/SMP: Multiprocessor motherboard: 4 CPUs cpu0 (BSP): apic id: 0, version: 0x00050014, at 0xfee00000 cpu1 (AP): apic id: 1, version: 0x00050014, at 0xfee00000 cpu2 (AP): apic id: 6, version: 0x00050014, at 0xfee00000 cpu3 (AP): apic id: 7, version: 0x00050014, at 0xfee00000 io0 (APIC): apic id: 8, version: 0x00178020, at 0xfec00000 io1 (APIC): apic id: 9, version: 0x00178020, at 0xfec81000 io2 (APIC): apic id: 10, version: 0x00178020, at 0xfec81400 Preloaded elf kernel "kernel" at 0xc03cc000. Preloaded userconfig_script "/boot/kernel.conf" at 0xc03cc09c. Warning: Pentium 4 CPU: PSE disabled Pentium Pro MTRR support enabled md0: Malloc disk Using $PIR table, 19 entries at 0xc00f3630 npx0: <math processor> on motherboard npx0: INT 16 interface pcib0: <Host to PCI bridge> on motherboard IOAPIC #0 intpin 16 -> irq 2 IOAPIC #0 intpin 19 -> irq 16 pci0: <PCI bus> on pcib0 pci0: <unknown card> (vendor=0x8086, dev=0x2541) at 0.1 pcib1: <PCI to PCI bridge (vendor=8086 device=2545)> at device 3.0 on pci0 pci2: <PCI bus> on pcib1 pci2: <unknown card> (vendor=0x8086, dev=0x1461) at 28.0 pcib2: <PCI to PCI bridge (vendor=8086 device=1460)> at device 29.0 on pci2 IOAPIC #2 intpin 2 -> irq 18 IOAPIC #2 intpin 1 -> irq 19 pci5: <PCI bus> on pcib2 ahd0: <Adaptec AIC7902 Ultra320 SCSI adapter> port 0x4000-0x40ff,0x3800-0x38ff mem 0xfe9e0000-0xfe9e1fff irq 18 at device 7.0 on pci5 aic7902: Ultra320 Wide Channel A, SCSI Id=7, PCI-X 67-100Mhz, 512 SCBs ahd1: <Adaptec AIC7902 Ultra320 SCSI adapter> port 0x3400-0x34ff,0x3000-0x30ff mem 0xfe9f0000-0xfe9f1fff irq 19 at device 7.1 on pci5 aic7902: Ultra320 Wide Channel B, SCSI Id=7, PCI-X 67-100Mhz, 512 SCBs pci2: <unknown card> (vendor=0x8086, dev=0x1461) at 30.0 pcib3: <PCI to PCI bridge (vendor=8086 device=1460)> at device 31.0 on pci2 IOAPIC #1 intpin 6 -> irq 20 IOAPIC #1 intpin 7 -> irq 21 pci3: <PCI bus> on pcib3 em0: <Intel(R) PRO/1000 Network Connection, Version - 1.7.35> port 0x2040-0x207f mem 0xfe6c0000-0xfe6dffff irq 20 at device 7.0 on pci 3 em0: Speed:N/A Duplex:N/A em1: <Intel(R) PRO/1000 Network Connection, Version - 1.7.35> port 0x2000-0x203f mem 0xfe6e0000-0xfe6fffff irq 21 at device 7.1 on pci 3 em1: Speed:N/A Duplex:N/A pcib4: <PCI to PCI bridge (vendor=1014 device=01a7)> at device 9.0 on pci3 IOAPIC #1 intpin 3 -> irq 22 pci4: <PCI bus> on pcib4 amr0: <LSILogic MegaRAID> mem 0xfe580000-0xfe5fffff,0xfbef0000-0xfbefffff irq 22 at device 0.0 on pci4 amr0: <LSILogic Intel(R) RAID Controller SRCU42X> Firmware 413Y, BIOS H420, 128MB RAM pci0: <unknown card> (vendor=0x8086, dev=0x2546) at 3.1 uhci0: <Intel 82801CA/CAM (ICH3) USB controller USB-A> port 0x5020-0x503f irq 2 at device 29.0 on pci0 usb0: <Intel 82801CA/CAM (ICH3) USB controller USB-A> on uhci0 usb0: USB revision 1.0 uhub0: Intel UHCI root hub, class 9/0, rev 1.00/1.00, addr 1 uhub0: 2 ports with 2 removable, self powered uhci1: <Intel 82801CA/CAM (ICH3) USB controller USB-B> port 0x5000-0x501f irq 16 at device 29.1 on pci0 usb1: <Intel 82801CA/CAM (ICH3) USB controller USB-B> on uhci1 usb1: USB revision 1.0 uhub1: Intel UHCI root hub, class 9/0, rev 1.00/1.00, addr 1 uhub1: 2 ports with 2 removable, self powered pcib5: <Intel 82801BA/BAM (ICH2) Hub to PCI bridge> at device 30.0 on pci0 pci1: <PCI bus> on pcib5 pci1: <ATI Mach64-GR graphics accelerator> at 12.0 irq 17 isab0: <PCI to ISA bridge (vendor=8086 device=2480)> at device 31.0 on pci0 isa0: <ISA bus> on isab0 atapci0: <Intel ICH3 ATA100 controller> port 0x3a0-0x3af,0-0x3,0-0x7,0-0x3,0-0x7 irq 0 at device 31.1 on pci0 ata0: at 0x1f0 irq 14 on atapci0 ata1: at 0x170 irq 15 on atapci0 pci0: <unknown card> (vendor=0x8086, dev=0x2483) at 31.3 irq 17 orm0: <Option ROMs> at iomem 0xc0000-0xc7fff,0xc8000-0xc8fff,0xc9000-0xc9fff on isa0 pmtimer0 on isa0 atkbdc0: <Keyboard controller (i8042)> at port 0x60,0x64 on isa0 atkbd0: <AT Keyboard> flags 0x1 irq 1 on atkbdc0 kbd0 at atkbd0 psm0: <PS/2 Mouse> irq 12 on atkbdc0 psm0: model Generic PS/2 mouse, device ID 0 vga0: <Generic ISA VGA> at port 0x3c0-0x3df iomem 0xa0000-0xbffff on isa0 sc0: <System console> at flags 0x100 on isa0 sc0: VGA <16 virtual consoles, flags=0x300> sio0 at port 0x3f8-0x3ff irq 4 flags 0x10 on isa0 sio0: type 16550A sio1 at port 0x2f8-0x2ff irq 3 on isa0 sio1: type 16550A APIC_IO: Testing 8254 interrupt delivery APIC_IO: routing 8254 via IOAPIC #0 intpin 2 SMP: AP CPU #2 Launched! SMP: AP CPU #1 Launched! SMP: AP CPU #3 Launched! acd0: CDROM <SAMSUNG CD-ROM SN-124> at ata1-master PIO4 Waiting 15 seconds for SCSI devices to settle amrd0: <LSILogic MegaRAID logical drive> on amr0 amrd0: 140012MB (286744576 sectors) RAID 5 (optimal) pass0 at amr0 bus 0 target 6 lun 0 pass0: <ESG-SHV SCA HSBP M22 0.06> Fixed Processor SCSI-2 device Mounting root from ufs:/dev/amrd0s1a Regards, Tony. -- Tony Byrne _______________________________________________ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "[EMAIL PROTECTED]"