Alexander Motin schrieb am 03.03.2010 09:18 (localtime):
Harald Schmalzbauer wrote:Alexander Motin schrieb am 23.02.2010 16:10 (localtime):Harald Schmalzbauer wrote:I'm frequently getting my machine locked with ahcichX timeouts: ahcich2: Timeout on slot 0 ahcich2: is 00000000 cs 00000001 ss 00000000 rs 00000001 tfd c0 serr 00000000 ahcich2: Timeout on slot 8 ahcich2: is 00000000 cs 00000100 ss 00000000 rs 00000100 tfd c0 serr 00000000 ahcich2: Timeout on slot 8 ahcich2: is 00000000 cs fffff07f ss ffffff7f rs ffffff7f tfd c0 serr 00000000 ...Looking that is (Interrupt status) is zero and `rs == cs | ss` (running command bitmasks in driver and hardware), controller doesn't report command completion. Looking on TFD status 0xc0 with BUSY bit set, I would suppose that either disk stuck in command processing for some reason, or controller missed command completion status.Have you noticed 30 second (default ATA timeout) pause before timeout message printed? Just want to be sure that driver waited enough before give up.This happens when backup over GbE overloads ZFS/HDD capabilities. I reduced vfs.zfs.txg.timeout to 1 to prevent the machine from locking up almost immediately, but from it still happens. When I don't use ahci but ataahci (the old driver if I understand things correct) I also see the ZFS burst write congestion, but this doesn't lead to controller timeouts, thus blocking the machine. Sometimes the machine recovers from the disk lock, but most often I have to reboot.How it looks when it doesn't? Can you send me full log messages?Hello, this morning I had a stall, but the machine recovered after about one Minute. Here's what I got from the kernel: ahcich2: Timeout on slot 29 ahcich2: is 00000000 cs 00000003 ss e0000003 rs e0000003 tfd c0 serr 00000000 em1: watchdog timeout -- resetting em1: watchdog timeout -- resetting ahcich2: Timeout on slot 10 ahcich2: is 00000000 cs 00006000 ss 00007c00 rs 00007c00 tfd c0 serr 00000000 ahcich2: Timeout on slot 18 ahcich2: is 00000000 cs 00040000 ss 00000000 rs 00040000 tfd c0 serr 00000000 ahcich2: Timeout on slot 2 ahcich2: is 00000000 cs 00000004 ss 00000000 rs 00000004 tfd c0 serr 00000000 ahcich2: Timeout on slot 2 ahcich2: is 00000000 cs 00000000 ss 0000000c rs 0000000c tfd 40 serr 00000000 Does this tell you something useful?It doesn't. Looking on logged register content - commands are indeed still running and no interrupts requested. Interesting to see em1 watchdog timeout there. Aren't they related somehow?
I have the drives now running in another server, ich7 chipset.Using UFS, the complete machine locks up for ~30 secs with disk load of 3.5MB/s. But I don't get any timeout messages and the machine always recovered.
Changing to the old ata driver solves the problem.Any chance to get this problem fixed? I couldn't see lockups on another OS with NCQ in AHCI mode enabled. I'd ship such a disk to anyone who is willing to debug.
Thanks, -Harry
signature.asc
Description: OpenPGP digital signature