Hi, when scrubbing an md array yesterday (do not know who triggered that yet but that's something I hope to cover in another thread...) I got lots of these:
Dec 2 01:38:45 xerxes kernel: [4942064.241510] ata6: hard resetting link Dec 2 01:38:45 xerxes kernel: [4942064.731991] ata6: SATA link up 1.5 Gbps (SStatus 113 SControl 310) Dec 2 01:38:45 xerxes kernel: [4942064.743759] ata6.00: configured for UDMA/33 Dec 2 01:38:45 xerxes kernel: [4942064.755973] ata6: EH complete Dec 2 01:39:11 xerxes kernel: [4942090.846069] ata6: hard resetting link Dec 2 01:39:12 xerxes kernel: [4942091.334346] ata6: SATA link up 1.5 Gbps (SStatus 113 SControl 310) Dec 2 01:39:12 xerxes kernel: [4942091.346117] ata6.00: configured for UDMA/33 Dec 2 01:39:12 xerxes kernel: [4942091.358342] ata6: EH complete Dec 2 01:39:37 xerxes kernel: [4942116.461263] ata6: hard resetting link Dec 2 01:39:37 xerxes kernel: [4942116.949209] ata6: SATA link up 1.5 Gbps (SStatus 113 SControl 310) Dec 2 01:39:37 xerxes kernel: [4942116.961046] ata6.00: configured for UDMA/33 Dec 2 01:39:37 xerxes kernel: [4942116.973225] ata6: EH complete Dec 2 01:39:45 xerxes kernel: [4942124.317211] ata6: hard resetting link Dec 2 01:39:45 xerxes kernel: [4942124.805185] ata6: SATA link up 1.5 Gbps (SStatus 113 SControl 310) Dec 2 01:39:45 xerxes kernel: [4942124.816985] ata6.00: configured for UDMA/33 Dec 2 01:39:45 xerxes kernel: [4942124.829171] ata6: EH complete Dec 2 01:39:50 xerxes kernel: [4942129.661310] ata6: hard resetting link Dec 2 01:39:51 xerxes kernel: [4942130.150457] ata6: SATA link up 1.5 Gbps (SStatus 113 SControl 310) Dec 2 01:39:51 xerxes kernel: [4942130.162256] ata6.00: configured for UDMA/33 Dec 2 01:39:51 xerxes kernel: [4942130.174443] ata6: EH complete Dec 2 01:39:55 xerxes kernel: [4942134.652582] ata6: hard resetting link etc etc. Kept repeating until the check was over, roughly every 5-30 seconds. The array check completed fine, and when I force-fed the array some test data I saw no trace of link resets in messages. I rather suspect the chipset or cabling to be the flaky part, when the system was all fresh I had quite a few issues of that kind until I pimped the chipset cooling. Cables are standard sATA cables like they come with any motherboard, 50cm in length. Then again that particular disk did log some errors in the SMART journal: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x002f 100 100 051 Pre-fail Always - 4536 2 Throughput_Performance 0x0026 053 053 000 Old_age Always - 17969 3 Spin_Up_Time 0x0023 061 060 025 Pre-fail Always - 11994 4 Start_Stop_Count 0x0032 094 094 000 Old_age Always - 6880 5 Reallocated_Sector_Ct 0x0033 252 252 010 Pre-fail Always - 0 7 Seek_Error_Rate 0x002e 252 252 051 Old_age Always - 0 8 Seek_Time_Performance 0x0024 252 252 015 Old_age Offline - 0 9 Power_On_Hours 0x0032 100 100 000 Old_age Always - 9953 10 Spin_Retry_Count 0x0032 252 252 051 Old_age Always - 0 11 Calibration_Retry_Count 0x0032 252 252 000 Old_age Always - 0 12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 470 191 G-Sense_Error_Rate 0x0022 091 091 000 Old_age Always - 93015 192 Power-Off_Retract_Count 0x0022 252 252 000 Old_age Always - 0 194 Temperature_Celsius 0x0002 064 053 000 Old_age Always - 26 (Min/Max 15/47) 195 Hardware_ECC_Recovered 0x003a 100 100 000 Old_age Always - 0 196 Reallocated_Event_Count 0x0032 252 252 000 Old_age Always - 0 197 Current_Pending_Sector 0x0032 252 252 000 Old_age Always - 0 198 Offline_Uncorrectable 0x0030 252 252 000 Old_age Offline - 0 199 UDMA_CRC_Error_Count 0x0036 099 099 000 Old_age Always - 760 200 Multi_Zone_Error_Rate 0x002a 100 100 000 Old_age Always - 307 223 Load_Retry_Count 0x0032 252 252 000 Old_age Always - 0 225 Load_Cycle_Count 0x0032 097 097 000 Old_age Always - 38029 SMART Error Log Version: 1 ATA Error Count: 5 CR = Command Register [HEX] FR = Features Register [HEX] SC = Sector Count Register [HEX] SN = Sector Number Register [HEX] CL = Cylinder Low Register [HEX] CH = Cylinder High Register [HEX] DH = Device/Head Register [HEX] DC = Device Command Register [HEX] ER = Error register [HEX] ST = Status register [HEX] Powered_Up_Time is measured from power on, and printed as DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes, SS=sec, and sss=millisec. It "wraps" after 49.710 days. Error 5 occurred at disk power-on lifetime: 2804 hours (116 days + 20 hours) When the command that caused the error occurred, the device was active or idle. After command completion occurred, registers were: ER ST SC SN CL CH DH -- -- -- -- -- -- -- 84 51 00 00 00 00 a0 Commands leading to the command that caused the error were: CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name -- -- -- -- -- -- -- -- ---------------- -------------------- ec 00 00 00 00 00 a0 0a 00:01:24.159 IDENTIFY DEVICE 27 00 00 00 00 00 e0 0a 00:01:24.159 READ NATIVE MAX ADDRESS EXT ec 00 00 00 00 00 a0 0a 00:01:24.159 IDENTIFY DEVICE 00 00 00 00 00 00 00 0a 00:01:24.159 NOP [Abort queued commands] 00 00 30 bf fc 3e 40 0a 00:01:24.158 NOP [Abort queued commands] Error 4 occurred at disk power-on lifetime: 2804 hours (116 days + 20 hours) When the command that caused the error occurred, the device was active or idle. After command completion occurred, registers were: ER ST SC SN CL CH DH -- -- -- -- -- -- -- 84 51 00 00 00 00 a0 Commands leading to the command that caused the error were: CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name -- -- -- -- -- -- -- -- ---------------- -------------------- ec 00 00 00 00 00 a0 0a 00:01:24.159 IDENTIFY DEVICE ec 00 00 00 00 00 a0 0a 00:01:24.159 IDENTIFY DEVICE 00 00 00 00 00 00 00 0a 00:01:24.159 NOP [Abort queued commands] 00 00 30 bf fc 3e 40 0a 00:01:24.158 NOP [Abort queued commands] 60 00 70 4f 94 7e 40 0a 00:01:22.644 READ FPDMA QUEUED Error 3 occurred at disk power-on lifetime: 2658 hours (110 days + 18 hours) When the command that caused the error occurred, the device was active or idle. After command completion occurred, registers were: ER ST SC SN CL CH DH -- -- -- -- -- -- -- 84 51 00 00 00 00 a0 Commands leading to the command that caused the error were: CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name -- -- -- -- -- -- -- -- ---------------- -------------------- ec 00 00 00 00 00 a0 0a 00:00:08.945 IDENTIFY DEVICE ec 00 00 00 00 00 a0 0a 00:00:08.945 IDENTIFY DEVICE 00 00 01 01 00 00 40 0a 00:00:08.945 NOP [Abort queued commands] 00 00 01 01 00 00 40 00 00:00:08.945 NOP [Abort queued commands] 25 00 08 3f 00 30 e0 0a 00:00:08.944 READ DMA EXT Error 2 occurred at disk power-on lifetime: 2356 hours (98 days + 4 hours) When the command that caused the error occurred, the device was active or idle. After command completion occurred, registers were: ER ST SC SN CL CH DH -- -- -- -- -- -- -- 84 51 00 00 00 00 a0 Commands leading to the command that caused the error were: CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name -- -- -- -- -- -- -- -- ---------------- -------------------- ec 00 00 00 00 00 a0 0a 00:00:27.230 IDENTIFY DEVICE ec 00 00 00 00 00 a0 0a 00:00:27.230 IDENTIFY DEVICE 00 08 00 00 00 00 00 08 00:00:27.230 NOP [Reserved subcommand] 00 08 0c d7 37 ba e0 08 00:00:27.229 NOP [Reserved subcommand] 60 08 10 2f 74 eb 40 08 00:00:27.161 READ FPDMA QUEUED Error 1 occurred at disk power-on lifetime: 2099 hours (87 days + 11 hours) When the command that caused the error occurred, the device was active or idle. After command completion occurred, registers were: ER ST SC SN CL CH DH -- -- -- -- -- -- -- 84 51 00 00 00 00 a0 Commands leading to the command that caused the error were: CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name -- -- -- -- -- -- -- -- ---------------- -------------------- ec 00 00 00 00 00 a0 0a 00:00:04.659 IDENTIFY DEVICE ec 00 00 00 00 00 a0 0a 00:00:04.659 IDENTIFY DEVICE ef 03 46 00 00 00 a0 0a 00:00:04.659 SET FEATURES [Set transfer mode] 27 00 00 00 00 00 e0 0a 00:00:04.659 READ NATIVE MAX ADDRESS EXT ec 00 00 00 00 00 a0 0a 00:00:04.659 IDENTIFY DEVICE SMART Self-test log structure revision number 1 Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error # 1 Extended offline Completed without error 00% 8084 - # 2 Short offline Completed without error 00% 5062 - # 3 Short offline Completed without error 00% 457 - Well - what do you make of this? In normal operations the link resets do not seem to occur at all, even under stress tests. Dex -- -----BEGIN GEEK CODE BLOCK----- Version: 3.12 GCS d--(+)@ s-:+ a C++++ UL++ P+>++ L+++>++++ E-- W++ N o? K- w--(---) !O M+ V- PS+ PE Y++ PGP t++(---)@ 5 X+(++) R+(++) tv--(+)@ b++(+++) DI+++ D- G++ e* h>++ r* y? ------END GEEK CODE BLOCK------
signature.asc
Description: This is a digitally signed message part.