Hi,

Thanks again for the detailed reply!

See the very bottom of my mail.  I don't believe the PSU is the problem,
after reviewing your SMART statistics.

Ok, I'll stick to the one I have then, for now.

My other (on-board) SATA controller is a VIA controller; and I've never had any problems with it (although the hardware raid messed up once a year or 2 ago, and since then I've been using software raid without any issues).

Okay, so you've got an onboard VIA (VT6410) SATA controller, an onboard
VIA IDE controller, and a PCI SATA controller.  I'd still like to know
which disks are attached to what controller, and if any of the devices
are sharing IRQs.  Can you provide the output from the following two
commands?

dmesg | egrep 'atapci|(ad|ata)[0-9]+'
vmstat -i

I'm just trying to narrow stuff down.

Allright, attached is the output to both of these commands.

It's interesting that the disks which are giving you trouble are Samsung
disks.  There's some history here which you should be made aware of:

In July, Daniel Eriksson reported data corruption occurring with his
nVidia MCP55 chipset when 1TB Samsung disks were attached to it.  The
same disks on another controller performed fine.  The corruption was
being detected by ZFS as checksum errors.  (UFS/UFS2 won't detect this
sort of thing, unless the corruption is occurring somewhere within the
filesystem tables.)

http://lists.freebsd.org/pipermail/freebsd-stable/2008-July/043427.html

Soren Schmidt (ata(4) author) replied that there are some nVidia
chipset-related fixes for ATA in -CURRENT, and provided a patch.  Daniel
reported that the patch made absolutely no difference:

http://lists.freebsd.org/pipermail/freebsd-stable/2008-July/043434.html

Daniel also tried using a firmware patch for his Samsung disks, which
limit the SATA speed to SATA150, but the speed was still negotiated as
SATA300 (indicating the vendors' own f/w patch is broken, or FreeBSD
does not play well with it).  The f/w patch didn't fix his problem
either:

http://lists.freebsd.org/pipermail/freebsd-stable/2008-July/043432.html

[EMAIL PROTECTED] reported using his MCP55 controller without any
problem -- as long as he didn't use Samsung disks.  He stated that he
believes Samsung disks are PATA disks that use a PATA-to-SATA adapter
inside of the drive, leading to problems (and yes, those adapters are
known to cause all sorts of mayhem):

http://lists.freebsd.org/pipermail/freebsd-stable/2008-July/043485.html

I'm not sure what became of the thread; Daniel never provided a
post-mortem.  I'm left to believe he probably took [EMAIL PROTECTED]'s
advice and switched to another disk vendor.

Gee, I that's a whole list. Before today I didn't know that there was that much difference between disk vendors (especially in terms of compatibility). I'll keep that in mind when I buy new disks. Thing is I've had a bunch of disks (Maxtor, Seagate, Western Digitals, Samsung, etc), but I've had bad experiences with both Seagate and Western Digital. (Basically, I've never had a Seagate last me more than 2 years (laptop drives), and I had a raid5 array of WD's of which 3 crashed within 2 years). Never had much trouble with Maxtor or Samsung yet, but obviously take this all with a grain of salt, because 10 disks don't make solid statistics.

Thanks for upgrading to 5.38.  All the SMART statistics for these disks
look okay.

No problem, thanks for looking into this in so much detail!

Can you run some SMART tests on the disks?  You can run these tests
while the disks are in use (but I/O will make the test take longer to
complete):

smartctl -t short /dev/ad4
smartctl -t short /dev/ad6

Then you'll need to look at the SMART self test log, as well as the
SMART error log, to see if anything is returned.  Make sure the tests
have completed (the Status field should be "Completed without error",
unless an error was found of course):

smartctl -a /dev/ad4
smartctl -a /dev/ad6

I attached the output below, the tests passed. But I thought I'd reply that you know I'm on it. Currently I'm running the offline tests, but they will take another 3 hours at least to complete. Will get you the output of those as soon as they're done.

If nothing is found, try a different test (also safe to run during
operation; don't let the word "offline" scare you), and repeat looking
at the logs once more.  This test may take some time, though:

smartctl -t offline /dev/ad4
smartctl -t offline /dev/ad6

At this point, I'm inclined to believe the issue is specific to those
Samsung disks.  I do not believe your PSU is a problem; the SMART
statistics would be showing a higher number of power-cycles if the disks
were losing power.

Worth noting (about Samsung disks) is that smartctl has options to work
around 3 different firmware bugs.  The bugs are SMART statistics-related,
but those kind of mistakes don't give me "warm fuzzies".  Be wary.  :-)

Nope, that definitely does not give great confidence.

I still haven't switched the disks with respect to the controller, but since I have very little knowledge of disk debugging, I'll follow up on your suggestions first.

Regards,
Sebastiaan

interrupt                          total       rate
irq6: fdc0                            10          0
irq14: ata0                       645057          7
irq15: ata1                           58          0
irq16: rl0                       7168276         82
irq17: rl1                        914667         10
irq18: atapci0                  30072876        347
irq20: atapci1                   1126099         12
irq21: uhci0 uhci*                   308          0
irq23: vr0                       3265771         37
cpu0: timer                    173289011       1999
Total                          216482133       2498
atapci0: <SiI SiI 3512 SATA150 controller> port 
0xd200-0xd207,0xd300-0xd303,0xd400-0xd407,0xd500-0xd503,0xd600-0xd60f mem 
0xf6081000-0xf60811ff irq 18 at device 10.0 on pci0
ata2: <ATA channel 0> on atapci0
ata3: <ATA channel 1> on atapci0
atapci1: <VIA 6420 SATA150 controller> port 
0xd700-0xd707,0xd800-0xd803,0xd900-0xd907,0xda00-0xda03,0xdb00-0xdb0f,0xdc00-0xdcff
 irq 20 at device 15.0 on pci0
ata4: <ATA channel 0> on atapci1
ata5: <ATA channel 1> on atapci1
atapci2: <VIA 8237 UDMA133 controller> port 
0x1f0-0x1f7,0x3f6,0x170-0x177,0x376,0xdd00-0xdd0f at device 15.1 on pci0
ata0: <ATA channel 0> on atapci2
ata1: <ATA channel 1> on atapci2
ad0: 286188MB <Maxtor 6L300R0 BAH41G10> at ata0-master UDMA133
ad1: 239372MB <Maxtor 6L250R0 BAH41G10> at ata0-slave UDMA133
acd0: DVDR <LITE-ON DVDRW SHW-1635S/YS0N> at ata1-master UDMA33
ad4: 953869MB <SAMSUNG HD103UJ 1AA01112> at ata2-master SATA150
ad6: 953869MB <SAMSUNG HD103UJ 1AA01112> at ata3-master SATA150
ad8: 239372MB <Maxtor 6L250S0 BANC1G10> at ata4-master SATA150
ad10: 239372MB <Maxtor 6L250S0 BANC1G10> at ata5-master SATA150
GEOM_MIRROR: Device gm1: provider ad4 detected.
GEOM_MIRROR: Device gm1: provider ad6 detected.
GEOM_MIRROR: Device gm1: provider ad6 activated.
GEOM_MIRROR: Device gm1: rebuilding provider ad4.
GEOM_MIRROR: Device gm0: provider ad8 detected.
GEOM_MIRROR: Device gm0: provider ad10 detected.
GEOM_MIRROR: Device gm0: provider ad10 activated.
GEOM_MIRROR: Device gm0: provider ad8 activated.
Trying to mount root from ufs:/dev/ad0s1a
ad4: FAILURE - SMART status=51<READY,DSC,ERROR> error=4<ABORTED>
ad6: FAILURE - SMART status=51<READY,DSC,ERROR> error=4<ABORTED>
ad4: FAILURE - SMART status=51<READY,DSC,ERROR> error=4<ABORTED>
ad4: FAILURE - SMART status=51<READY,DSC,ERROR> error=4<ABORTED>
ad6: FAILURE - SMART status=51<READY,DSC,ERROR> error=4<ABORTED>
GEOM_MIRROR: Device gm1: rebuilding provider ad4 finished.
GEOM_MIRROR: Device gm1: provider ad4 activated.
smartctl version 5.38 [i386-portbld-freebsd6.3] Copyright (C) 2002-8 Bruce Allen
Home page is http://smartmontools.sourceforge.net/

=== START OF INFORMATION SECTION ===
Device Model:     SAMSUNG HD103UJ
Serial Number:    S13PJ1BQ606865
Firmware Version: 1AA01112
User Capacity:    1,000,204,886,016 bytes
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   8
ATA Standard is:  ATA-8-ACS revision 3b
Local Time is:    Wed Aug  6 11:30:09 2008 CEST

==> WARNING: May need -F samsung or -F samsung2 enabled; see manual for details.

SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x00) Offline data collection activity
                                        was never started.
                                        Auto Offline Data Collection: Disabled.
Self-test execution status:      (   0) The previous self-test routine completed
                                        without error or no self-test has ever 
                                        been run.
Total time to complete Offline 
data collection:                 (11811) seconds.
Offline data collection
capabilities:                    (0x7b) SMART execute Offline immediate.
                                        Auto Offline data collection on/off 
support.
                                        Suspend Offline collection upon new
                                        command.
                                        Offline surface scan supported.
                                        Self-test supported.
                                        Conveyance Self-test supported.
                                        Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                                        power-saving mode.
                                        Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                                        General Purpose Logging supported.
Short self-test routine 
recommended polling time:        (   2) minutes.
Extended self-test routine
recommended polling time:        ( 198) minutes.
Conveyance self-test routine
recommended polling time:        (  21) minutes.
SCT capabilities:              (0x003f) SCT Status supported.
                                        SCT Feature Control supported.
                                        SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  
WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000f   100   100   051    Pre-fail  Always       
-       0
  3 Spin_Up_Time            0x0007   090   090   011    Pre-fail  Always       
-       4050
  4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       
-       4
  5 Reallocated_Sector_Ct   0x0033   100   100   010    Pre-fail  Always       
-       0
  7 Seek_Error_Rate         0x000f   253   253   051    Pre-fail  Always       
-       0
  8 Seek_Time_Performance   0x0025   100   100   015    Pre-fail  Offline      
-       10297
  9 Power_On_Hours          0x0032   100   100   000    Old_age   Always       
-       250
 10 Spin_Retry_Count        0x0033   100   100   051    Pre-fail  Always       
-       0
 11 Calibration_Retry_Count 0x0012   100   100   000    Old_age   Always       
-       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       
-       4
 13 Read_Soft_Error_Rate    0x000e   100   100   000    Old_age   Always       
-       0
183 Unknown_Attribute       0x0032   100   100   000    Old_age   Always       
-       0
184 Unknown_Attribute       0x0033   100   100   099    Pre-fail  Always       
-       0
187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always       
-       0
188 Unknown_Attribute       0x0032   100   100   000    Old_age   Always       
-       0
190 Airflow_Temperature_Cel 0x0022   055   054   000    Old_age   Always       
-       45 (Lifetime Min/Max 40/46)
194 Temperature_Celsius     0x0022   054   052   000    Old_age   Always       
-       46 (Lifetime Min/Max 36/49)
195 Hardware_ECC_Recovered  0x001a   100   100   000    Old_age   Always       
-       153751007
196 Reallocated_Event_Count 0x0032   100   100   000    Old_age   Always       
-       0
197 Current_Pending_Sector  0x0012   100   100   000    Old_age   Always       
-       0
198 Offline_Uncorrectable   0x0030   100   100   000    Old_age   Offline      
-       0
199 UDMA_CRC_Error_Count    0x003e   100   100   000    Old_age   Always       
-       0
200 Multi_Zone_Error_Rate   0x000a   100   100   000    Old_age   Always       
-       0
201 Soft_Read_Error_Rate    0x000a   253   253   000    Old_age   Always       
-       0

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 0
Warning: ATA Specification requires self-test log structure revision number = 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  
LBA_of_first_error
# 1  Short offline       Completed without error       00%       250         -

SMART Selective Self-Test Log Data Structure Revision Number (0) should be 1
SMART Selective self-test log data structure revision number 0
Warning: ATA Specification requires selective self-test log data structure 
revision number = 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

smartctl version 5.38 [i386-portbld-freebsd6.3] Copyright (C) 2002-8 Bruce Allen
Home page is http://smartmontools.sourceforge.net/

=== START OF INFORMATION SECTION ===
Device Model:     SAMSUNG HD103UJ
Serial Number:    S13PJ1BQ607102
Firmware Version: 1AA01112
User Capacity:    1,000,204,886,016 bytes
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   8
ATA Standard is:  ATA-8-ACS revision 3b
Local Time is:    Wed Aug  6 11:31:28 2008 CEST

==> WARNING: May need -F samsung or -F samsung2 enabled; see manual for details.

SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x00) Offline data collection activity
                                        was never started.
                                        Auto Offline Data Collection: Disabled.
Self-test execution status:      (   0) The previous self-test routine completed
                                        without error or no self-test has ever 
                                        been run.
Total time to complete Offline 
data collection:                 (12131) seconds.
Offline data collection
capabilities:                    (0x7b) SMART execute Offline immediate.
                                        Auto Offline data collection on/off 
support.
                                        Suspend Offline collection upon new
                                        command.
                                        Offline surface scan supported.
                                        Self-test supported.
                                        Conveyance Self-test supported.
                                        Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                                        power-saving mode.
                                        Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                                        General Purpose Logging supported.
Short self-test routine 
recommended polling time:        (   2) minutes.
Extended self-test routine
recommended polling time:        ( 203) minutes.
Conveyance self-test routine
recommended polling time:        (  22) minutes.
SCT capabilities:              (0x003f) SCT Status supported.
                                        SCT Feature Control supported.
                                        SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  
WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000f   100   100   051    Pre-fail  Always       
-       0
  3 Spin_Up_Time            0x0007   090   090   011    Pre-fail  Always       
-       3870
  4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       
-       4
  5 Reallocated_Sector_Ct   0x0033   100   100   010    Pre-fail  Always       
-       0
  7 Seek_Error_Rate         0x000f   253   253   051    Pre-fail  Always       
-       0
  8 Seek_Time_Performance   0x0025   100   100   015    Pre-fail  Offline      
-       10213
  9 Power_On_Hours          0x0032   100   100   000    Old_age   Always       
-       250
 10 Spin_Retry_Count        0x0033   100   100   051    Pre-fail  Always       
-       0
 11 Calibration_Retry_Count 0x0012   100   100   000    Old_age   Always       
-       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       
-       4
 13 Read_Soft_Error_Rate    0x000e   100   100   000    Old_age   Always       
-       0
183 Unknown_Attribute       0x0032   100   100   000    Old_age   Always       
-       0
184 Unknown_Attribute       0x0033   100   100   099    Pre-fail  Always       
-       0
187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always       
-       0
188 Unknown_Attribute       0x0032   100   100   000    Old_age   Always       
-       0
190 Airflow_Temperature_Cel 0x0022   057   056   000    Old_age   Always       
-       43 (Lifetime Min/Max 38/44)
194 Temperature_Celsius     0x0022   056   054   000    Old_age   Always       
-       44 (Lifetime Min/Max 35/46)
195 Hardware_ECC_Recovered  0x001a   100   100   000    Old_age   Always       
-       196672230
196 Reallocated_Event_Count 0x0032   100   100   000    Old_age   Always       
-       0
197 Current_Pending_Sector  0x0012   100   100   000    Old_age   Always       
-       0
198 Offline_Uncorrectable   0x0030   100   100   000    Old_age   Offline      
-       0
199 UDMA_CRC_Error_Count    0x003e   100   100   000    Old_age   Always       
-       0
200 Multi_Zone_Error_Rate   0x000a   253   253   000    Old_age   Always       
-       0
201 Soft_Read_Error_Rate    0x000a   100   100   000    Old_age   Always       
-       0

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 0
Warning: ATA Specification requires self-test log structure revision number = 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  
LBA_of_first_error
# 1  Short offline       Completed without error       00%       250         -

SMART Selective Self-Test Log Data Structure Revision Number (0) should be 1
SMART Selective self-test log data structure revision number 0
Warning: ATA Specification requires selective self-test log data structure 
revision number = 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

Attachment: smime.p7s
Description: S/MIME Cryptographic Signature

Reply via email to