I've suffered from a similar problem:

Short version:
The "pci=nomsi" kernel option resolved my problem - stable for 4 days now.

Long version:
Using Ubuntu 9.10, 2.6.31-21-generic-pae kernel with raid 1, using AHCI.
I often got errors such as:
[97203.222589] ata4.00: exception Emask 0x0 SAct 0x1ffff SErr 0x0 action 0x6 
frozen
[97203.222603] ata4.00: cmd 61/80:00:f3:b8:4a/01:00:39:00:00/40 tag 0 ncq 
196608 out
[97203.222606]          res 40/00:00:00:4f:c2/00:00:00:00:00/40 Emask 0x4 
(timeout)
[97203.222611] ata4.00: status: { DRDY }
...
[97203.222852] ata4: hard resetting link
[97206.170332] ata4: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
[97206.172770] ata4.00: configured for UDMA/133
[97206.172777] ata4.00: device reported invalid CHS sector 0
...
[97206.172858] ata4: EH complete

After some time my raid 1 went out of sync - always /dev/sdb2, a partition on a 
seagate drive. /dev/sda is a samsung.
For example:
[112124.582332] ata4.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
[112124.582345] ata4.00: cmd ea/00:00:00:00:00/00:00:00:00:00/a0 tag 0
[112124.582347]          res 40/00:08:d3:72:29/00:00:34:00:00/40 Emask 0x4 
(timeout)
[112124.582352] ata4.00: status: { DRDY }
[112124.582359] ata4: hard resetting link
[112125.069041] ata4: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
[112125.071381] ata4.00: configured for UDMA/133
[112125.071395] ata4: EH complete
[112125.079713] end_request: I/O error, dev sdb, sector 976767859
[112125.079720] md: super_written gets error=-5, uptodate=0
[112125.079725] raid1: Disk failure on sdb2, disabling device.
[112125.079726] raid1: Operation continuing on 1 devices.

I have tested the drives with smartctl several times and no errors where
reported.

https://ata.wiki.kernel.org/index.php/Libata_error_messages#Error_classes 
mentioned the following:
timeout
Controller failed to respond to an active ATA command. This could be any number 
of causes. Most often this is due to an unrelated interrupt subsystem bug (try 
booting with 'pci=nomsi' or 'acpi=off' or 'noapic'), which failed to deliver an 
interrupt when we were expecting one from the hardware. 

I tried all 3 but the timeout error still occured (less frequently with all 3 
options), but with "pci=nomsi" raid stays up. The error that I saw with the 
other 2 options when raid failed was:
[58902.450853] sd 3:0:0:0: [sdb] Unhandled error code
[58902.450859] sd 3:0:0:0: [sdb] Result: hostbyte=DID_OK 
driverbyte=DRIVER_TIMEOUT
[58902.450865] end_requesdevice reported invalid CHS sector 0t: I/O error, dev 
sdb, sector 899754683
[58902.450871] raid1: Disk failure on sdb2, disabling device.
[58902.450873] raid1: Operation continuing on 1 devices.

When raid failed originally I did the following:
mdadm --remove /dev/md1 /dev/sdb2
mdadm /dev/md0 --add /dev/sdb1
It synced and then worked for some time - about 2 days before failing again.

It does not seem to be load related - failures happened in the middle of
the night at times.

Hope this helps - I'm about to upgrade to 10.4.

-- 
Filling disk with data leads to [sda] Unhandled error code. [sda] Result 
hostbyte=DID_OK driverbyte=DRIVER_TIMEOUT
https://bugs.launchpad.net/bugs/577796
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

Reply via email to