Hi, Thanks again for the detailed reply!
See the very bottom of my mail. I don't believe the PSU is the problem, after reviewing your SMART statistics.
Ok, I'll stick to the one I have then, for now.
My other (on-board) SATA controller is a VIA controller; and I've never had any problems with it (although the hardware raid messed up once a year or 2 ago, and since then I've been using software raid without any issues).Okay, so you've got an onboard VIA (VT6410) SATA controller, an onboard VIA IDE controller, and a PCI SATA controller. I'd still like to know which disks are attached to what controller, and if any of the devices are sharing IRQs. Can you provide the output from the following two commands? dmesg | egrep 'atapci|(ad|ata)[0-9]+' vmstat -i I'm just trying to narrow stuff down.
Allright, attached is the output to both of these commands.
It's interesting that the disks which are giving you trouble are Samsung disks. There's some history here which you should be made aware of: In July, Daniel Eriksson reported data corruption occurring with his nVidia MCP55 chipset when 1TB Samsung disks were attached to it. The same disks on another controller performed fine. The corruption was being detected by ZFS as checksum errors. (UFS/UFS2 won't detect this sort of thing, unless the corruption is occurring somewhere within the filesystem tables.) http://lists.freebsd.org/pipermail/freebsd-stable/2008-July/043427.html Soren Schmidt (ata(4) author) replied that there are some nVidia chipset-related fixes for ATA in -CURRENT, and provided a patch. Daniel reported that the patch made absolutely no difference: http://lists.freebsd.org/pipermail/freebsd-stable/2008-July/043434.html Daniel also tried using a firmware patch for his Samsung disks, which limit the SATA speed to SATA150, but the speed was still negotiated as SATA300 (indicating the vendors' own f/w patch is broken, or FreeBSD does not play well with it). The f/w patch didn't fix his problem either: http://lists.freebsd.org/pipermail/freebsd-stable/2008-July/043432.html [EMAIL PROTECTED] reported using his MCP55 controller without any problem -- as long as he didn't use Samsung disks. He stated that he believes Samsung disks are PATA disks that use a PATA-to-SATA adapter inside of the drive, leading to problems (and yes, those adapters are known to cause all sorts of mayhem): http://lists.freebsd.org/pipermail/freebsd-stable/2008-July/043485.html I'm not sure what became of the thread; Daniel never provided a post-mortem. I'm left to believe he probably took [EMAIL PROTECTED]'s advice and switched to another disk vendor.
Gee, I that's a whole list. Before today I didn't know that there was that much difference between disk vendors (especially in terms of compatibility). I'll keep that in mind when I buy new disks. Thing is I've had a bunch of disks (Maxtor, Seagate, Western Digitals, Samsung, etc), but I've had bad experiences with both Seagate and Western Digital. (Basically, I've never had a Seagate last me more than 2 years (laptop drives), and I had a raid5 array of WD's of which 3 crashed within 2 years). Never had much trouble with Maxtor or Samsung yet, but obviously take this all with a grain of salt, because 10 disks don't make solid statistics.
Thanks for upgrading to 5.38. All the SMART statistics for these disks look okay.
No problem, thanks for looking into this in so much detail!
Can you run some SMART tests on the disks? You can run these tests while the disks are in use (but I/O will make the test take longer to complete): smartctl -t short /dev/ad4 smartctl -t short /dev/ad6 Then you'll need to look at the SMART self test log, as well as the SMART error log, to see if anything is returned. Make sure the tests have completed (the Status field should be "Completed without error", unless an error was found of course): smartctl -a /dev/ad4 smartctl -a /dev/ad6
I attached the output below, the tests passed. But I thought I'd reply that you know I'm on it. Currently I'm running the offline tests, but they will take another 3 hours at least to complete. Will get you the output of those as soon as they're done.
If nothing is found, try a different test (also safe to run during operation; don't let the word "offline" scare you), and repeat looking at the logs once more. This test may take some time, though: smartctl -t offline /dev/ad4 smartctl -t offline /dev/ad6 At this point, I'm inclined to believe the issue is specific to those Samsung disks. I do not believe your PSU is a problem; the SMART statistics would be showing a higher number of power-cycles if the disks were losing power. Worth noting (about Samsung disks) is that smartctl has options to work around 3 different firmware bugs. The bugs are SMART statistics-related, but those kind of mistakes don't give me "warm fuzzies". Be wary. :-)
Nope, that definitely does not give great confidence.I still haven't switched the disks with respect to the controller, but since I have very little knowledge of disk debugging, I'll follow up on your suggestions first.
Regards, Sebastiaan
interrupt total rate irq6: fdc0 10 0 irq14: ata0 645057 7 irq15: ata1 58 0 irq16: rl0 7168276 82 irq17: rl1 914667 10 irq18: atapci0 30072876 347 irq20: atapci1 1126099 12 irq21: uhci0 uhci* 308 0 irq23: vr0 3265771 37 cpu0: timer 173289011 1999 Total 216482133 2498
atapci0: <SiI SiI 3512 SATA150 controller> port 0xd200-0xd207,0xd300-0xd303,0xd400-0xd407,0xd500-0xd503,0xd600-0xd60f mem 0xf6081000-0xf60811ff irq 18 at device 10.0 on pci0 ata2: <ATA channel 0> on atapci0 ata3: <ATA channel 1> on atapci0 atapci1: <VIA 6420 SATA150 controller> port 0xd700-0xd707,0xd800-0xd803,0xd900-0xd907,0xda00-0xda03,0xdb00-0xdb0f,0xdc00-0xdcff irq 20 at device 15.0 on pci0 ata4: <ATA channel 0> on atapci1 ata5: <ATA channel 1> on atapci1 atapci2: <VIA 8237 UDMA133 controller> port 0x1f0-0x1f7,0x3f6,0x170-0x177,0x376,0xdd00-0xdd0f at device 15.1 on pci0 ata0: <ATA channel 0> on atapci2 ata1: <ATA channel 1> on atapci2 ad0: 286188MB <Maxtor 6L300R0 BAH41G10> at ata0-master UDMA133 ad1: 239372MB <Maxtor 6L250R0 BAH41G10> at ata0-slave UDMA133 acd0: DVDR <LITE-ON DVDRW SHW-1635S/YS0N> at ata1-master UDMA33 ad4: 953869MB <SAMSUNG HD103UJ 1AA01112> at ata2-master SATA150 ad6: 953869MB <SAMSUNG HD103UJ 1AA01112> at ata3-master SATA150 ad8: 239372MB <Maxtor 6L250S0 BANC1G10> at ata4-master SATA150 ad10: 239372MB <Maxtor 6L250S0 BANC1G10> at ata5-master SATA150 GEOM_MIRROR: Device gm1: provider ad4 detected. GEOM_MIRROR: Device gm1: provider ad6 detected. GEOM_MIRROR: Device gm1: provider ad6 activated. GEOM_MIRROR: Device gm1: rebuilding provider ad4. GEOM_MIRROR: Device gm0: provider ad8 detected. GEOM_MIRROR: Device gm0: provider ad10 detected. GEOM_MIRROR: Device gm0: provider ad10 activated. GEOM_MIRROR: Device gm0: provider ad8 activated. Trying to mount root from ufs:/dev/ad0s1a ad4: FAILURE - SMART status=51<READY,DSC,ERROR> error=4<ABORTED> ad6: FAILURE - SMART status=51<READY,DSC,ERROR> error=4<ABORTED> ad4: FAILURE - SMART status=51<READY,DSC,ERROR> error=4<ABORTED> ad4: FAILURE - SMART status=51<READY,DSC,ERROR> error=4<ABORTED> ad6: FAILURE - SMART status=51<READY,DSC,ERROR> error=4<ABORTED> GEOM_MIRROR: Device gm1: rebuilding provider ad4 finished. GEOM_MIRROR: Device gm1: provider ad4 activated.
smartctl version 5.38 [i386-portbld-freebsd6.3] Copyright (C) 2002-8 Bruce Allen Home page is http://smartmontools.sourceforge.net/ === START OF INFORMATION SECTION === Device Model: SAMSUNG HD103UJ Serial Number: S13PJ1BQ606865 Firmware Version: 1AA01112 User Capacity: 1,000,204,886,016 bytes Device is: In smartctl database [for details use: -P show] ATA Version is: 8 ATA Standard is: ATA-8-ACS revision 3b Local Time is: Wed Aug 6 11:30:09 2008 CEST ==> WARNING: May need -F samsung or -F samsung2 enabled; see manual for details. SMART support is: Available - device has SMART capability. SMART support is: Enabled === START OF READ SMART DATA SECTION === SMART overall-health self-assessment test result: PASSED General SMART Values: Offline data collection status: (0x00) Offline data collection activity was never started. Auto Offline Data Collection: Disabled. Self-test execution status: ( 0) The previous self-test routine completed without error or no self-test has ever been run. Total time to complete Offline data collection: (11811) seconds. Offline data collection capabilities: (0x7b) SMART execute Offline immediate. Auto Offline data collection on/off support. Suspend Offline collection upon new command. Offline surface scan supported. Self-test supported. Conveyance Self-test supported. Selective Self-test supported. SMART capabilities: (0x0003) Saves SMART data before entering power-saving mode. Supports SMART auto save timer. Error logging capability: (0x01) Error logging supported. General Purpose Logging supported. Short self-test routine recommended polling time: ( 2) minutes. Extended self-test routine recommended polling time: ( 198) minutes. Conveyance self-test routine recommended polling time: ( 21) minutes. SCT capabilities: (0x003f) SCT Status supported. SCT Feature Control supported. SCT Data Table supported. SMART Attributes Data Structure revision number: 16 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x000f 100 100 051 Pre-fail Always - 0 3 Spin_Up_Time 0x0007 090 090 011 Pre-fail Always - 4050 4 Start_Stop_Count 0x0032 100 100 000 Old_age Always - 4 5 Reallocated_Sector_Ct 0x0033 100 100 010 Pre-fail Always - 0 7 Seek_Error_Rate 0x000f 253 253 051 Pre-fail Always - 0 8 Seek_Time_Performance 0x0025 100 100 015 Pre-fail Offline - 10297 9 Power_On_Hours 0x0032 100 100 000 Old_age Always - 250 10 Spin_Retry_Count 0x0033 100 100 051 Pre-fail Always - 0 11 Calibration_Retry_Count 0x0012 100 100 000 Old_age Always - 0 12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 4 13 Read_Soft_Error_Rate 0x000e 100 100 000 Old_age Always - 0 183 Unknown_Attribute 0x0032 100 100 000 Old_age Always - 0 184 Unknown_Attribute 0x0033 100 100 099 Pre-fail Always - 0 187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 0 188 Unknown_Attribute 0x0032 100 100 000 Old_age Always - 0 190 Airflow_Temperature_Cel 0x0022 055 054 000 Old_age Always - 45 (Lifetime Min/Max 40/46) 194 Temperature_Celsius 0x0022 054 052 000 Old_age Always - 46 (Lifetime Min/Max 36/49) 195 Hardware_ECC_Recovered 0x001a 100 100 000 Old_age Always - 153751007 196 Reallocated_Event_Count 0x0032 100 100 000 Old_age Always - 0 197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always - 0 198 Offline_Uncorrectable 0x0030 100 100 000 Old_age Offline - 0 199 UDMA_CRC_Error_Count 0x003e 100 100 000 Old_age Always - 0 200 Multi_Zone_Error_Rate 0x000a 100 100 000 Old_age Always - 0 201 Soft_Read_Error_Rate 0x000a 253 253 000 Old_age Always - 0 SMART Error Log Version: 1 No Errors Logged SMART Self-test log structure revision number 0 Warning: ATA Specification requires self-test log structure revision number = 1 Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error # 1 Short offline Completed without error 00% 250 - SMART Selective Self-Test Log Data Structure Revision Number (0) should be 1 SMART Selective self-test log data structure revision number 0 Warning: ATA Specification requires selective self-test log data structure revision number = 1 SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS 1 0 0 Not_testing 2 0 0 Not_testing 3 0 0 Not_testing 4 0 0 Not_testing 5 0 0 Not_testing Selective self-test flags (0x0): After scanning selected spans, do NOT read-scan remainder of disk. If Selective self-test is pending on power-up, resume after 0 minute delay.
smartctl version 5.38 [i386-portbld-freebsd6.3] Copyright (C) 2002-8 Bruce Allen Home page is http://smartmontools.sourceforge.net/ === START OF INFORMATION SECTION === Device Model: SAMSUNG HD103UJ Serial Number: S13PJ1BQ607102 Firmware Version: 1AA01112 User Capacity: 1,000,204,886,016 bytes Device is: In smartctl database [for details use: -P show] ATA Version is: 8 ATA Standard is: ATA-8-ACS revision 3b Local Time is: Wed Aug 6 11:31:28 2008 CEST ==> WARNING: May need -F samsung or -F samsung2 enabled; see manual for details. SMART support is: Available - device has SMART capability. SMART support is: Enabled === START OF READ SMART DATA SECTION === SMART overall-health self-assessment test result: PASSED General SMART Values: Offline data collection status: (0x00) Offline data collection activity was never started. Auto Offline Data Collection: Disabled. Self-test execution status: ( 0) The previous self-test routine completed without error or no self-test has ever been run. Total time to complete Offline data collection: (12131) seconds. Offline data collection capabilities: (0x7b) SMART execute Offline immediate. Auto Offline data collection on/off support. Suspend Offline collection upon new command. Offline surface scan supported. Self-test supported. Conveyance Self-test supported. Selective Self-test supported. SMART capabilities: (0x0003) Saves SMART data before entering power-saving mode. Supports SMART auto save timer. Error logging capability: (0x01) Error logging supported. General Purpose Logging supported. Short self-test routine recommended polling time: ( 2) minutes. Extended self-test routine recommended polling time: ( 203) minutes. Conveyance self-test routine recommended polling time: ( 22) minutes. SCT capabilities: (0x003f) SCT Status supported. SCT Feature Control supported. SCT Data Table supported. SMART Attributes Data Structure revision number: 16 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x000f 100 100 051 Pre-fail Always - 0 3 Spin_Up_Time 0x0007 090 090 011 Pre-fail Always - 3870 4 Start_Stop_Count 0x0032 100 100 000 Old_age Always - 4 5 Reallocated_Sector_Ct 0x0033 100 100 010 Pre-fail Always - 0 7 Seek_Error_Rate 0x000f 253 253 051 Pre-fail Always - 0 8 Seek_Time_Performance 0x0025 100 100 015 Pre-fail Offline - 10213 9 Power_On_Hours 0x0032 100 100 000 Old_age Always - 250 10 Spin_Retry_Count 0x0033 100 100 051 Pre-fail Always - 0 11 Calibration_Retry_Count 0x0012 100 100 000 Old_age Always - 0 12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 4 13 Read_Soft_Error_Rate 0x000e 100 100 000 Old_age Always - 0 183 Unknown_Attribute 0x0032 100 100 000 Old_age Always - 0 184 Unknown_Attribute 0x0033 100 100 099 Pre-fail Always - 0 187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 0 188 Unknown_Attribute 0x0032 100 100 000 Old_age Always - 0 190 Airflow_Temperature_Cel 0x0022 057 056 000 Old_age Always - 43 (Lifetime Min/Max 38/44) 194 Temperature_Celsius 0x0022 056 054 000 Old_age Always - 44 (Lifetime Min/Max 35/46) 195 Hardware_ECC_Recovered 0x001a 100 100 000 Old_age Always - 196672230 196 Reallocated_Event_Count 0x0032 100 100 000 Old_age Always - 0 197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always - 0 198 Offline_Uncorrectable 0x0030 100 100 000 Old_age Offline - 0 199 UDMA_CRC_Error_Count 0x003e 100 100 000 Old_age Always - 0 200 Multi_Zone_Error_Rate 0x000a 253 253 000 Old_age Always - 0 201 Soft_Read_Error_Rate 0x000a 100 100 000 Old_age Always - 0 SMART Error Log Version: 1 No Errors Logged SMART Self-test log structure revision number 0 Warning: ATA Specification requires self-test log structure revision number = 1 Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error # 1 Short offline Completed without error 00% 250 - SMART Selective Self-Test Log Data Structure Revision Number (0) should be 1 SMART Selective self-test log data structure revision number 0 Warning: ATA Specification requires selective self-test log data structure revision number = 1 SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS 1 0 0 Not_testing 2 0 0 Not_testing 3 0 0 Not_testing 4 0 0 Not_testing 5 0 0 Not_testing Selective self-test flags (0x0): After scanning selected spans, do NOT read-scan remainder of disk. If Selective self-test is pending on power-up, resume after 0 minute delay.
Description: S/MIME Cryptographic Signature