I've suffered from a similar problem: Short version: The "pci=nomsi" kernel option resolved my problem - stable for 4 days now.
Long version: Using Ubuntu 9.10, 2.6.31-21-generic-pae kernel with raid 1, using AHCI. I often got errors such as: [97203.222589] ata4.00: exception Emask 0x0 SAct 0x1ffff SErr 0x0 action 0x6 frozen [97203.222603] ata4.00: cmd 61/80:00:f3:b8:4a/01:00:39:00:00/40 tag 0 ncq 196608 out [97203.222606] res 40/00:00:00:4f:c2/00:00:00:00:00/40 Emask 0x4 (timeout) [97203.222611] ata4.00: status: { DRDY } ... [97203.222852] ata4: hard resetting link [97206.170332] ata4: SATA link up 3.0 Gbps (SStatus 123 SControl 300) [97206.172770] ata4.00: configured for UDMA/133 [97206.172777] ata4.00: device reported invalid CHS sector 0 ... [97206.172858] ata4: EH complete After some time my raid 1 went out of sync - always /dev/sdb2, a partition on a seagate drive. /dev/sda is a samsung. For example: [112124.582332] ata4.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen [112124.582345] ata4.00: cmd ea/00:00:00:00:00/00:00:00:00:00/a0 tag 0 [112124.582347] res 40/00:08:d3:72:29/00:00:34:00:00/40 Emask 0x4 (timeout) [112124.582352] ata4.00: status: { DRDY } [112124.582359] ata4: hard resetting link [112125.069041] ata4: SATA link up 3.0 Gbps (SStatus 123 SControl 300) [112125.071381] ata4.00: configured for UDMA/133 [112125.071395] ata4: EH complete [112125.079713] end_request: I/O error, dev sdb, sector 976767859 [112125.079720] md: super_written gets error=-5, uptodate=0 [112125.079725] raid1: Disk failure on sdb2, disabling device. [112125.079726] raid1: Operation continuing on 1 devices. I have tested the drives with smartctl several times and no errors where reported. https://ata.wiki.kernel.org/index.php/Libata_error_messages#Error_classes mentioned the following: timeout Controller failed to respond to an active ATA command. This could be any number of causes. Most often this is due to an unrelated interrupt subsystem bug (try booting with 'pci=nomsi' or 'acpi=off' or 'noapic'), which failed to deliver an interrupt when we were expecting one from the hardware. I tried all 3 but the timeout error still occured (less frequently with all 3 options), but with "pci=nomsi" raid stays up. The error that I saw with the other 2 options when raid failed was: [58902.450853] sd 3:0:0:0: [sdb] Unhandled error code [58902.450859] sd 3:0:0:0: [sdb] Result: hostbyte=DID_OK driverbyte=DRIVER_TIMEOUT [58902.450865] end_requesdevice reported invalid CHS sector 0t: I/O error, dev sdb, sector 899754683 [58902.450871] raid1: Disk failure on sdb2, disabling device. [58902.450873] raid1: Operation continuing on 1 devices. When raid failed originally I did the following: mdadm --remove /dev/md1 /dev/sdb2 mdadm /dev/md0 --add /dev/sdb1 It synced and then worked for some time - about 2 days before failing again. It does not seem to be load related - failures happened in the middle of the night at times. Hope this helps - I'm about to upgrade to 10.4. -- Filling disk with data leads to [sda] Unhandled error code. [sda] Result hostbyte=DID_OK driverbyte=DRIVER_TIMEOUT https://bugs.launchpad.net/bugs/577796 You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. -- ubuntu-bugs mailing list ubuntu-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs