Harald Schmalzbauer schrieb am 13.03.2010 22:27 (localtime):
Am 03.03.2010 12:06, schrieb Jeremy Chadwick:On Wed, Mar 03, 2010 at 09:28:25AM +0100, Harald Schmalzbauer wrote:Alexander Motin schrieb am 03.03.2010 09:18 (localtime):Harald Schmalzbauer wrote:Alexander Motin schrieb am 23.02.2010 16:10 (localtime):Harald Schmalzbauer wrote:I'm frequently getting my machine locked with ahcichX timeouts: ahcich2: Timeout on slot 0 ahcich2: is 00000000 cs 00000001 ss 00000000 rs 00000001 tfd c0 serr 00000000 ahcich2: Timeout on slot 8 ahcich2: is 00000000 cs 00000100 ss 00000000 rs 00000100 tfd c0 serr 00000000 ahcich2: Timeout on slot 8 ahcich2: is 00000000 cs fffff07f ss ffffff7f rs ffffff7f tfd c0 serr 00000000 ...Looking that is (Interrupt status) is zero and `rs == cs | ss` (running command bitmasks in driver and hardware), controller doesn't report command completion. Looking on TFD status 0xc0 with BUSY bit set, I would suppose that either disk stuck in command processing for some reason, or controller missed command completion status.Have you noticed 30 second (default ATA timeout) pause before timeout message printed? Just want to be sure that driver waited enough before give up.This happens when backup over GbE overloads ZFS/HDD capabilities. I reduced vfs.zfs.txg.timeout to 1 to prevent the machine from locking up almost immediately, but from it still happens. When I don't use ahci but ataahci (the old driver if I understand things correct) I also see the ZFS burst write congestion, but this doesn't lead to controller timeouts, thus blocking the machine. Sometimes the machine recovers from the disk lock, but most often I have to reboot.How it looks when it doesn't? Can you send me full log messages?Hello, this morning I had a stall, but the machine recovered after about one Minute. Here's what I got from the kernel: ahcich2: Timeout on slot 29 ahcich2: is 00000000 cs 00000003 ss e0000003 rs e0000003 tfd c0 serr 00000000 em1: watchdog timeout -- resetting em1: watchdog timeout -- resetting ahcich2: Timeout on slot 10 ahcich2: is 00000000 cs 00006000 ss 00007c00 rs 00007c00 tfd c0 serr 00000000 ahcich2: Timeout on slot 18 ahcich2: is 00000000 cs 00040000 ss 00000000 rs 00040000 tfd c0 serr 00000000 ahcich2: Timeout on slot 2 ahcich2: is 00000000 cs 00000004 ss 00000000 rs 00000004 tfd c0 serr 00000000 ahcich2: Timeout on slot 2 ahcich2: is 00000000 cs 00000000 ss 0000000c rs 0000000c tfd 40 serr 00000000 Does this tell you something useful?It doesn't. Looking on logged register content - commands are indeed still running and no interrupts requested. Interesting to see em1 watchdog timeout there. Aren't they related somehow?dmesg | grep "irq 18": uhci0: <Intel 82801I (ICH9) USB controller> port 0x20c0-0x20df irq 18 at device 26.0 on pci0 uhci4: <Intel 82801I (ICH9) USB controller> port 0x2040-0x205f irq 18 at device 29.2 on pci0 em1: <Intel(R) PRO/1000 Network Connection 6.9.14> port 0x1000-0x103f mem 0xe1920000-0xe193ffff,0xe1900000-0xe191ffff irq 18 at device 2.0 on pci3 ichsmb0: <Intel 82801I (ICH9) SMBus controller> port 0x2000-0x201f mem 0xe1a22000-0xe1a220ff irq 18 at device 31.3 on pci0 The don't share the same IRQ at least. dmesg | grep "irq 21" uhci1: <Intel 82801I (ICH9) USB controller> port 0x20a0-0x20bf irq 21 at device 26.1 on pci0 ahci0: <Intel ICH9 AHCI SATA controller> port 0x2408-0x240f,0x2414-0x2417,0x2400-0x2407,0x2410-0x2413,0x2020-0x203f mem 0xe1a21000-0xe1a217ff irq 21 at device 31.2 on pci0 The em1 has no cable attached. I get many of these em watchdog timeouts. Never thought they could be related to ahci. I'll see if the em watchdog timeouts happens in any relation to disk usage.Please provide output from the commands I provided. dmesg|grep is not sufficient for helping track this down, specifically with regards to the em1 watchdog timeouts.Sorry for the delay, here's the details: hos...@pci0:0:0:0: class=0x060000 card=0x34d08086 chip=0x29f08086 rev=0x00 hdr=0x00 vendor = 'Intel Corporation' device = '3200 Chipset (Bearlake) Processor to I/O Controller' class = bridge subclass = HOST-PCI e...@pci0:0:25:0: class=0x020000 card=0x34d08086 chip=0x10bd8086 rev=0x02 hdr=0x00 vendor = 'Intel Corporation' device = 'Intel 82566DM Gigabit Ethernet Adapter (82566DM)' class = network subclass = ethernet uh...@pci0:0:26:0: class=0x0c0300 card=0x34d08086 chip=0x29378086 rev=0x02 hdr=0x00 vendor = 'Intel Corporation' device = '82801IB/IR/IH (ICH9 Family) USB Universal Host Controller' class = serial bus subclass = USB uh...@pci0:0:26:1: class=0x0c0300 card=0x34d08086 chip=0x29388086 rev=0x02 hdr=0x00 vendor = 'Intel Corporation' device = '82801IB/IR/IH (ICH9 Family) USB Universal Host Controller' class = serial bus subclass = USB eh...@pci0:0:26:7: class=0x0c0320 card=0x34d08086 chip=0x293c8086 rev=0x02 hdr=0x00 vendor = 'Intel Corporation' device = '82801IB/IR/IH (ICH9 Family) USB2 Enhanced Host Controller' class = serial bus subclass = USB pc...@pci0:0:28:0: class=0x060400 card=0x29408086 chip=0x29408086 rev=0x02 hdr=0x01 vendor = 'Intel Corporation' device = '82801IB/IR/IH (ICH9 Family) PCIe Root Port 1' class = bridge subclass = PCI-PCI pc...@pci0:0:28:4: class=0x060400 card=0x29488086 chip=0x29488086 rev=0x02 hdr=0x01 vendor = 'Intel Corporation' device = '82801IB/IR/IH (ICH9 Family) PCIe Root Port 5' class = bridge subclass = PCI-PCI uh...@pci0:0:29:0: class=0x0c0300 card=0x34d08086 chip=0x29348086 rev=0x02 hdr=0x00 vendor = 'Intel Corporation' device = '82801IB/IR/IH (ICH9 Family) USB Universal Host Controller' class = serial bus subclass = USB uh...@pci0:0:29:1: class=0x0c0300 card=0x34d08086 chip=0x29358086 rev=0x02 hdr=0x00 vendor = 'Intel Corporation' device = '82801IB/IR/IH (ICH9 Family) USB Universal Host Controller' class = serial bus subclass = USB uh...@pci0:0:29:2: class=0x0c0300 card=0x34d08086 chip=0x29368086 rev=0x02 hdr=0x00 vendor = 'Intel Corporation' device = '82801IB/IR/IH (ICH9 Family) USB Universal Host Controller' class = serial bus subclass = USB eh...@pci0:0:29:7: class=0x0c0320 card=0x34d08086 chip=0x293a8086 rev=0x02 hdr=0x00 vendor = 'Intel Corporation' device = '82801IB/IR/IH (ICH9 Family) USB2 Enhanced Host Controller' class = serial bus subclass = USB pc...@pci0:0:30:0: class=0x060401 card=0x34d08086 chip=0x244e8086 rev=0x92 hdr=0x01 vendor = 'Intel Corporation' device = '82801 Family (ICH2/3/4/5/6/7/8/9,63xxESB) Hub Interface to PCI Bridge' class = bridge subclass = PCI-PCI is...@pci0:0:31:0: class=0x060100 card=0x34d08086 chip=0x29168086 rev=0x02 hdr=0x00 vendor = 'Intel Corporation' device = '82801IR (ICH9R) LPC Interface Controller' class = bridge subclass = PCI-ISA ah...@pci0:0:31:2: class=0x010601 card=0x34d08086 chip=0x29228086 rev=0x02 hdr=0x00 vendor = 'Intel Corporation' device = '82801IB/IR/IH (ICH9 Family) 6 port SATA AHCI Controller' class = mass storage subclass = SATA ichs...@pci0:0:31:3: class=0x0c0500 card=0x34d08086 chip=0x29308086 rev=0x02 hdr=0x00 vendor = 'Intel Corporation' device = '82801IB/IR/IH (ICH9 Family) SMBus Controller' class = serial bus subclass = SMBus vgap...@pci0:2:0:0: class=0x030000 card=0x01018086 chip=0x0522102b rev=0x02 hdr=0x00 vendor = 'Matrox Electronic Systems Ltd.' device = 'Matrox G200e (ServerEngines) - English (G200e)' class = display subclass = VGA e...@pci0:3:2:0: class=0x020000 card=0x34d08086 chip=0x10768086 rev=0x05 hdr=0x00 vendor = 'Intel Corporation' device = 'Gigabit Ethernet Controller (82541EI)' class = network subclass = ethernet banana:~>20: vmstat -i interrupt total rate irq4: uart0 27392 0 irq18: em1 uhci0++ 178 0 irq19: uhci3 22 0 irq23: uhci2 ehci1 2 0 cpu0: timer 231629257 1917 irq256: em0 2559286 21 irq257: ahci0 1602196 13 cpu1: timer 231628629 1917 Total 467446962 3869 Looks very similar to the problem here: http://unix.derkeiler.com/Mailing-Lists/FreeBSD/stable/2010-03/msg00143.html
For the records, again I have those timeouts, but with different status bits: em1: watchdog timeout -- resetting em1: watchdog timeout -- resetting em1: watchdog timeout -- resetting em1: watchdog timeout -- resetting ahcich2: Timeout on slot 14ahcich2: is 00000000 cs 00004000 ss 00000000 rs 00004000 tfd c0 serr 00000000
ahcich2: Timeout on slot 14ahcich2: is 00000000 cs fff81fff ss ffffdfff rs ffffdfff tfd c0 serr 00000000
ahcich2: Timeout on slot 10ahcich2: is 00000000 cs 00000c00 ss 00000000 rs 00000c00 tfd c0 serr 00000000
ahcich2: Timeout on slot 6ahcich2: is 00000000 cs 00000040 ss 00000000 rs 00000040 tfd c0 serr 00000000
ahcich2: Timeout on slot 2ahcich2: is 00000000 cs 00000004 ss 00000000 rs 00000004 tfd c0 serr 00000000
ahcich2: Timeout on slot 30ahcich2: is 00000000 cs 40000000 ss 00000000 rs 40000000 tfd c0 serr 00000000
ahcich2: Timeout on slot 30ahcich2: is 00000000 cs 1ffffffc ss dfffffff rs dfffffff tfd c0 serr 00000000
em1: watchdog timeout -- resetting em1: watchdog timeout -- resettingThe machine is now hangig for several hours, no login at resial console possible. To emhazise, em1 is not connected/used, em0 is the productive interface. I'll have to drive on site... What do these bits tell compared to the hangs above, especcially the middle one, which the machine recovered from? No interrupts requested and `rs == cs | ss` meaning controller doesn't report command completion. That's identical. Any other useful info? If not I won't post future timeout messages.
Thanks, -Harry
signature.asc
Description: OpenPGP digital signature