Hi, we have about 60 opteron servers running sarge amd64 with a 2.6.20.3 vanilla kernel (previously they were running 2.6.18.1; I installed the newer kernel in the hope that the problem would disappear).
At a rate of about 1-2 per week the SATA disk in a server freezes and I have to reboot. Because of the statistical nature of the effect (and since smartctl -a doesn't display any errors) I conclude that this is not a hardware issue but a software problem. Any idea how to narrow down the problem? Thanks, Thomas ------------------------------------------------------------------------- Disk: Western Digital SATA 250GB Device Model: WDC WD2500YS-01SHB0 Firmware Version: 20.06C03 The board is a TYAN S3993 Thunder h2000M with ServerWorks BCM5780 (HT2000) chipset. Here is a typical error log: Mar 29 09:07:24 ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x40000000 action 0x2 frozen Mar 29 09:07:24 ata1.00: (BMDMA stat 0x61) Mar 29 09:07:24 ata1.00: tag 0 cmd 0xca Emask 0x4 stat 0x40 err 0x0 (timeout) Mar 29 09:07:31 ata1: port is slow to respond, please be patient Mar 29 09:07:54 ata1: port failed to respond (30 secs) Mar 29 09:07:54 ata1: soft resetting port Mar 29 09:08:01 ata1: port is slow to respond, please be patient Mar 29 09:08:24 ata1: port failed to respond (30 secs) Mar 29 09:08:25 ata1: SATA link up 1.5 Gbps (SStatus 113 SControl 300) Mar 29 09:08:25 ATA: abnormal status 0xD0 on port 0xFFFFC2000000401C Mar 29 09:08:25 ATA: abnormal status 0xD0 on port 0xFFFFC2000000401C Mar 29 09:08:26 ATA: abnormal status 0xD0 on port 0xFFFFC2000000401C Mar 29 09:08:26 ATA: abnormal status 0xD0 on port 0xFFFFC2000000401C Mar 29 09:08:26 ATA: abnormal status 0xD0 on port 0xFFFFC2000000401C Mar 29 09:08:26 ATA: abnormal status 0xD0 on port 0xFFFFC2000000401C Mar 29 09:08:54 ata1.00: qc timeout (cmd 0xec) Mar 29 09:08:54 ata1.00: failed to IDENTIFY (I/O error, err_mask=0x4) Mar 29 09:08:54 ata1.00: revalidation failed (errno=-5) Mar 29 09:08:55 ata1: failed to recover some devices, retrying in 5 secs Mar 29 09:08:59 ata1: hard resetting port Mar 29 09:09:07 ata1: port is slow to respond, please be patient Mar 29 09:09:30 ata1: port failed to respond (30 secs) Mar 29 09:09:31 ata1: COMRESET failed (device not ready) Mar 29 09:09:31 ata1: hardreset failed, retrying in 5 secs Mar 29 09:09:35 ata1: hard resetting port Mar 29 09:09:42 ata1: port is slow to respond, please be patient Mar 29 09:10:05 ata1: port failed to respond (30 secs) Mar 29 09:10:05 ata1: COMRESET failed (device not ready) Mar 29 09:10:05 ata1: hardreset failed, retrying in 5 secs Mar 29 09:10:10 ata1: hard resetting port Mar 29 09:10:18 ata1: port is slow to respond, please be patient Mar 29 09:10:41 ata1: port failed to respond (30 secs) Mar 29 09:10:41 ata1: COMRESET failed (device not ready) Mar 29 09:10:41 ata1: reset failed, giving up ar 29 09:10:41 Mar 29 09:10:41 ata1.00: disabled Mar 29 09:10:41 ata1: EH complete Mar 29 09:10:41 sd 0:0:0:0: SCSI error: return code = 0x00040000 Mar 29 09:10:41 end_request: I/O error, dev sda, sector 8059914 Mar 29 09:10:41 sd 0:0:0:0: SCSI error: return code = 0x00040000 Mar 29 09:10:41 end_request: I/O error, dev sda, sector 14590970 Mar 29 09:10:41 Buffer I/O error on device sda2, logical block 823825 Mar 29 09:10:41 lost page write due to I/O error on sda2 Mar 29 09:10:41 sd 0:0:0:0: SCSI error: return code = 0x00040000 Mar 29 09:10:41 end_request: I/O error, dev sda, sector 14489178 Mar 29 09:10:41 Buffer I/O error on device sda2, logical block 811101 Mar 29 09:10:41 lost page write due to I/O error on sda2 Mar 29 09:10:41 sd 0:0:0:0: SCSI error: return code = 0x00040000 Mar 29 09:10:41 end_request: I/O error, dev sda, sector 14489762 Mar 29 09:10:41 Buffer I/O error on device sda2, logical block 811174 Mar 29 09:10:41 lost page write due to I/O error on sda2 Mar 29 09:10:41 Buffer I/O error on device sda2, logical block 811175 Mar 29 09:10:42 lost page write due to I/O error on sda2 Mar 29 09:10:42 sd 0:0:0:0: SCSI error: return code = 0x00040000 Mar 29 09:10:42 end_request: I/O error, dev sda, sector 14488442 Mar 29 09:10:42 Buffer I/O error on device sda2, logical block 811009 Mar 29 09:10:42 lost page write due to I/O error on sda2 Mar 29 09:10:42 sd 0:0:0:0: SCSI error: return code = 0x00040000 Mar 29 09:10:42 end_request: I/O error, dev sda, sector 14561994 Mar 29 09:10:42 Buffer I/O error on device sda2, logical block 820203 Mar 29 09:10:42 lost page write due to I/O error on sda2 Mar 29 09:10:42 sd 0:0:0:0: SCSI error: return code = 0x00040000 Mar 29 09:10:42 end_request: I/O error, dev sda, sector 8060026 Mar 29 09:10:42 Buffer I/O error on device sda2, logical block 7457 Mar 29 09:10:42 lost page write due to I/O error on sda2 Mar 29 09:10:42 Aborting journal on device sda2. -- To UNSUBSCRIBE, email to [EMAIL PROTECTED] with a subject of "unsubscribe". Trouble? Contact [EMAIL PROTECTED]