Alexander Motin schrieb am 23.02.2010 16:10 (localtime):
Harald Schmalzbauer wrote:I'm frequently getting my machine locked with ahcichX timeouts: ahcich2: Timeout on slot 0 ahcich2: is 00000000 cs 00000001 ss 00000000 rs 00000001 tfd c0 serr 00000000 ahcich2: Timeout on slot 8 ahcich2: is 00000000 cs 00000100 ss 00000000 rs 00000100 tfd c0 serr 00000000 ahcich2: Timeout on slot 8 ahcich2: is 00000000 cs fffff07f ss ffffff7f rs ffffff7f tfd c0 serr 00000000 ...Looking that is (Interrupt status) is zero and `rs == cs | ss` (running command bitmasks in driver and hardware), controller doesn't report command completion. Looking on TFD status 0xc0 with BUSY bit set, I would suppose that either disk stuck in command processing for some reason, or controller missed command completion status. Have you noticed 30 second (default ATA timeout) pause before timeout message printed? Just want to be sure that driver waited enough before give up.
Yes, there is some pause between the occurance of the hang and the first timeout message. But I can't tell you exactly if it's 30 seconds. I guess rather more than 30 sec.
This happens when backup over GbE overloads ZFS/HDD capabilities. I reduced vfs.zfs.txg.timeout to 1 to prevent the machine from locking up almost immediately, but from it still happens. When I don't use ahci but ataahci (the old driver if I understand things correct) I also see the ZFS burst write congestion, but this doesn't lead to controller timeouts, thus blocking the machine. Sometimes the machine recovers from the disk lock, but most often I have to reboot.How it looks when it doesn't? Can you send me full log messages?
Unfortunately not. That happened only once (which I recognized), 3 days ago and messages got turned over 5 times since then... But I have some messages from 02/15, with kernel from january. Usually the messages continue to pop up until I reset the machine. This time there were only the three above, even after waiting half an hour (had to go on site). The old messages:
ahcich2: Timeout on slot 20ahcich2: is 00000000 cs ff07ffff ss fff7ffff rs fff7ffff tfd c0 serr 00000000
ahcich4: Timeout on slot 24ahcich4: is 00000000 cs f07fffff ss ff7fffff rs ff7fffff tfd c0 serr 00000000
ahcich2: Timeout on slot 17ahcich2: is 00000000 cs fff9ffff ss ffffffff rs ffffffff tfd c0 serr 00000000
ahcich4: Timeout on slot 20ahcich4: is 00000000 cs 00300000 ss 00000000 rs 00300000 tfd c0 serr 00000000
ahcich2: Timeout on slot 15ahcich2: is 00000000 cs fff87fff ss ffffffff rs ffffffff tfd c0 serr 00000000
ahcich4: Timeout on slot 22ahcich4: is 00000000 cs fc0fffff ss ffcfffff rs ffcfffff tfd c0 serr 00000000
ahcich2: Timeout on slot 13ahcich2: is 00000000 cs ffff1fff ss ffffffff rs ffffffff tfd c0 serr 00000000
ahcich4: Timeout on slot 16ahcich4: is 00000000 cs 00010000 ss 00000000 rs 00010000 tfd c0 serr 00000000
ahcich2: Timeout on slot 11ahcich2: is 00000000 cs ffffc7ff ss ffffffff rs ffffffff tfd c0 serr 00000000
ahcich4: Timeout on slot 16ahcich4: is 00000000 cs 00000000 ss 00010000 rs 00010000 tfd 40 serr 00000000
Maybe it's helpful to you. Since I haven't seen the hang after upgrading, although doing extensive network transfer tests, I thought it vanished and haven't kept logs safe...
Kernel is from Feb. 19, so recent ahci improovements are active. Controller is ICH9R with 3 Samsung F3 SpinPoints. Any ideas how to work arround the hangs other than using the old ahci driver?Old ataahci driver wasn't using NCQ. NCQ may trigger some bugs in drive firmware or expose some protocol inconsistencies. I would recommend you to search for some errata for your drive and possibly firmware update.
Sounds reasonable. How can I disable NCQ with new ahci?I guess if it's a HDD firmware issue with NCQ the hang shouldn't happen when NCQ is disabled. Btw, I found camcontrol cmd ada0 -a "EF 85 00 00 00 00 00 00 00 00 00 00" for disabling APM and another one for disabling AAM. I did that for my drives. Is there a wiki where we can place such valuable commands?
Thanks, -Harry
signature.asc
Description: OpenPGP digital signature