Re: detecting bad disks

Chris Cappuccio Thu, 08 Nov 2007 10:02:15 -0800

All drives develop read errors over time.  When you write to these blocks,
it may automatically remap them and the errors disappear.  Just because you
get some read errors doesn't meant the drive is necessarily about to die.  But
if you develop new bad blocks with any frequency, you might want to replace
the drive.


Derick Siddoway [EMAIL PROTECTED] wrote:
> Trying to copy a file from one filesystem to another, I kept getting
> input/output errors.  I noticed these messages in the logs:
> 
> wd1a:  uncorrectable data error reading fsbn 768416 of 768384-0 (wd1 bn 
> 768416; cn 762 tn 5 sn 5), retrying
> wd1a:  uncorrectable data error reading fsbn 768416 of 768384-0 (wd1 bn 
> 768416; cn 762 tn 5 sn 5), retrying
> wd1a:  uncorrectable data error reading fsbn 768416 of 768384-0 (wd1 bn 
> 768416; cn 762 tn 5 sn 5), retrying
> wd1a:  uncorrectable data error reading fsbn 768416 of 768384-0 (wd1 bn 
> 768416; cn 762 tn 5 sn 5), retrying
> wd1a:  uncorrectable data error reading fsbn 768417 of 768384-0 (wd1 bn 
> 768417; cn 762 tn 5 sn 6), retrying
> wd1a:  uncorrectable data error reading fsbn 768417 of 768384-0 (wd1 bn 
> 768417; cn 762 tn 5 sn 6)
> 
> Okay, so clearly wd1 has some issues (wd1a is the only filesystem on that
> disk).  I've already started moving the data to a different disk.
> 
> Now, I thought I was going to be alerted to this sort of thing automatically
> because of an entry like this one in the crontab:
> 
> 0 * * * *       /sbin/atactl /dev/wd0c smartstatus >/dev/null
> 
> However, when I run this by hand, I get
> 
> [EMAIL PROTECTED]:$ sudo /sbin/atactl /dev/wd1 smartstatus
> No SMART threshold exceeded
> 
> So clearly, the SMART stuff wasn't going to tell me about this.
> 
> However:
> [EMAIL PROTECTED]:$ sudo /sbin/atactl /dev/wd1 readattr
> Attributes table revision: 16
> ID      Attribute name                  Threshold       Value   Raw
>   1     Raw Read Error Rate               51            199     0x000000000081
>   3     Spin Up Time                      21            123     0x000000001127
>   4     Start/Stop Count                  40             99     0x00000000056f
>   5     Reallocated Sector Count         140            200     0x000000000000
>   7     Seek Error Rate                   51            200     0x000000000000
>   9     Power-on Hours Count               0             73     0x000000004da4
>  10     Spin Retry Count                  51            100     0x000000000000
>  11     Unknown                           51            100     0x000000000000
>  12     Device Power Cycle Count           0             99     0x00000000056e
> 194     Temperature                        0            101     0x000000000031
> 196     Reallocation Event Count           0            200     0x000000000000
> 197     Current Pending Sector Count       0            197     0x000000000068
> 198     Off-line Scan Uncorrectable Sect   0            199     0x000000000032
> 199     Ultra DMA CRC Error Count          0            200     0x000000000000
> 
> I see a number of values that exceed the preset threshholds.
> But I see the same kinds of values on the other three drives:
> 
> [EMAIL PROTECTED]:$ sudo /sbin/atactl /dev/wd0 readattr
> Attributes table revision: 16
> ID      Attribute name                  Threshold       Value   Raw
>   1     Raw Read Error Rate               51            200     0x000000000000
>   3     Spin Up Time                      21             96     0x00000000175f
>   4     Start/Stop Count                  40             96     0x00000000110f
>   5     Reallocated Sector Count         140            196     0x00000000003a
>   7     Seek Error Rate                   51            200     0x000000000000
>   9     Power-on Hours Count               0             80     0x000000003a71
>  10     Spin Retry Count                  51            100     0x000000000000
>  11     Unknown                           51            100     0x000000000000
>  12     Device Power Cycle Count           0             99     0x000000000585
> 196     Reallocation Event Count           0            181     0x000000000013
> 197     Current Pending Sector Count       0            200     0x000000000000
> 198     Off-line Scan Uncorrectable Sect   0            200     0x000000000000
> 199     Ultra DMA CRC Error Count          0            200     0x000000000001
> 200     Unknown                           51            200     0x000000000000
> [EMAIL PROTECTED]:$ sudo /sbin/atactl /dev/wd2 readattr
> Attributes table revision: 16
> ID      Attribute name                  Threshold       Value   Raw
>   3     Spin Up Time                      63            200     0x000000001b3c
>   4     Start/Stop Count                   0            253     0x000000000020
>   5     Reallocated Sector Count          63            253     0x000000000000
>   6     Unknown                          100            253     0x000000000000
>   7     Seek Error Rate                    0            253     0x000000000000
>   8     Seek Time Performance            187            253     0x00000000aa64
>   9     Power-on Hours Count               0            217     0x00000000b2b8
>  10     Spin Retry Count                 157            253     0x000000000000
>  11     Unknown                          223            253     0x000000000000
>  12     Device Power Cycle Count           0            253     0x00000000003b
> 192     Power-off Retract Count            0            253     0x000000000000
> 193     Load Cycle Count                   0            253     0x000000000000
> 194     Temperature                        0            253     0x00000000001f
> 195     Unknown                            0            253     0x000000009dba
> 196     Reallocation Event Count           0            253     0x000000000000
> 197     Current Pending Sector Count       0            253     0x000000000000
> 198     Off-line Scan Uncorrectable Sect   0            253     0x000000000000
> 199     Ultra DMA CRC Error Count          0            199     0x000000000000
> 200     Unknown                            0            253     0x000000000000
> 201     Unknown                            0            253     0x00000000014e
> 202     Unknown                            0            253     0x000000000000
> 203     Unknown                          180            253     0x000000000008
> 204     Unknown                            0            253     0x000000000000
> 205     Unknown                            0            253     0x000000000000
> 207     Unknown                            0            253     0x000000000000
> 208     Unknown                            0            253     0x000000000000
> 209     Unknown                            0            253     0x000000000000
>  99     Unknown                            0            253     0x000000000000
> 100     Unknown                            0            253     0x000000000000
> 101     Unknown                            0            253     0x000000000000
> [EMAIL PROTECTED]:$ sudo /sbin/atactl /dev/wd3 readattr
> Attributes table revision: 16
> ID      Attribute name                  Threshold       Value   Raw
>   3     Spin Up Time                      63            204     0x00000000330f
>   4     Start/Stop Count                   0            253     0x000000000041
>   5     Reallocated Sector Count          63            253     0x000000000000
>   6     Unknown                          100            253     0x000000000000
>   7     Seek Error Rate                    0            253     0x000000000000
>   8     Seek Time Performance            187            253     0x00000000c738
>   9     Power-on Hours Count               0            211     0x000000006ace
>  10     Spin Retry Count                 157            253     0x000000000000
>  11     Unknown                          223            253     0x000000000000
>  12     Device Power Cycle Count           0            253     0x000000000063
> 192     Power-off Retract Count            0            253     0x000000000000
> 193     Load Cycle Count                   0            253     0x000000000000
> 194     Temperature                        0            253     0x000000000024
> 195     Unknown                            0            253     0x000000000ca3
> 196     Reallocation Event Count           0            253     0x000000000000
> 197     Current Pending Sector Count       0            253     0x000000000000
> 198     Off-line Scan Uncorrectable Sect   0            253     0x000000000000
> 199     Ultra DMA CRC Error Count          0            199     0x000000000000
> 200     Unknown                            0            253     0x000000000000
> 201     Unknown                            0            253     0x000000000000
> 202     Unknown                            0            253     0x000000000000
> 203     Unknown                          180            253     0x000000000000
> 204     Unknown                            0            253     0x000000000000
> 205     Unknown                            0            253     0x000000000000
> 207     Unknown                            0            253     0x000000000000
> 208     Unknown                            0            253     0x000000000000
> 209     Unknown                            0            193     0x000000000000
>  99     Unknown                            0            253     0x000000000000
> 100     Unknown                            0            253     0x000000000000
> 101     Unknown                            0            253     0x000000000000
> [EMAIL PROTECTED]:$ 
> 
> I'm not sure what to believe in all of this.  The only thing I can clearly
> state is that wd1 appears to be going bad, but I can't tell a good way to
> be alerted of this fact prior to actually getting input/output errors in
> the filesystem.  What's the best way to do this short of monitoring?
> 
> 
> -- 
> Derick Siddoway      And so, the children of the revolution were faced with 
> the
> [EMAIL PROTECTED]  age-old problem: it wasn't that you had the wrong kind of 
>                      government, which was obvious, but that you had the wrong
>                      kind of people.  ( Terry Pratchett, "Night Watch" )

-- 
Those who can, do.
Those who can't, sue.

Re: detecting bad disks

Reply via email to