On Aug 19, 2011, at 11:24 PM, Jeremy Chadwick wrote: > On Fri, Aug 19, 2011 at 09:39:17PM -0400, Dan Langille wrote: >> >> On Aug 19, 2011, at 7:21 PM, Jeremy Chadwick wrote: >> >>> On Fri, Aug 19, 2011 at 04:50:01PM -0400, Dan Langille wrote: >>>> System in question: FreeBSD 8.2-STABLE #3: Thu Mar 3 04:52:04 GMT 2011 >>>> >>>> After a recent power failure, I'm seeing this in my logs: >>>> >>>> Aug 19 20:36:34 bast smartd[1575]: Device: /dev/ad2, 2 Currently >>>> unreadable (pending) sectors >>> >>> I doubt this is related to a power failure. >>> >>>> Searching on that error message, I was led to believe that identifying the >>>> bad sector and >>>> running dd to read it would cause the HDD to reallocate that bad block. >>>> >>>> http://smartmontools.sourceforge.net/badblockhowto.html >>> >>> This is incorrect (meaning you've misunderstood what's written there). >>> >>> Unreadable LBAs can be a result of the LBA being actually bad (as in >>> uncorrectable), or the LBA being marked "suspect". In either case the >>> LBA will return an I/O error when read. >>> >>> If the LBAs are marked "suspect", the drive will perform re-analysis of >>> the LBA (to determine if the LBA can be read and the data re-mapped, or >>> if it cannot then the LBA is marked uncorrectable) when you **write** to >>> the LBA. >>> >>> The above smartd output doesn't tell me much. Providing actual SMART >>> attribute data (smartctl -a) for the drive would help. The brand of the >>> drive, the firmware version, and the model all matter -- every drive >>> behaves a little differently. >> >> Information such as this? >> http://beta.freebsddiary.org/smart-fixing-bad-sector.php > > Yes, perfect. Thank you. First thing first: upgrade smartmontools to > 5.41. Your attributes will be the same after you do this (the drive is > already in smartmontools' internal drive DB), but I often have to remind > people that they really need to keep smartmontools updated as often as > possible. The changes between versions are vast; this is especially > important for people with SSDs (I'm responsible for submitting some > recent improvements for Intel 320 and 510 SSDs).
Done. > Anyway, the drive (albeit an old PATA Maxtor) appears to have three > anomalies: > > 1) One confirmed reallocated LBA (SMART attribute 5) > > 2) One "suspect" LBA (SMART attribute 197) > > 3) A very high temperature of 51C (SMART attribute 194). If this drive > is in an enclosure or in a system with no fans this would be > understandable, otherwise this is a bit high. My home workstation which > has only one case fan has a drive with more platters than your Maxtor, > and it idles at ~38C. Possibly this drive has been undergoing constant > I/O recently (which does greatly increase drive temperature)? Not sure. > I'm not going to focus too much on this one. This is an older system. I suspect insufficient ventilation. I'll look at getting a new case fan, if not some HDD fans. > The SMART error log also indicates an LBA failure at the 26000 hour mark > (which is 16 hours prior to when you did smartctl -a /dev/ad2). Whether > that LBA is the remapped one or the suspect one is unknown. The LBA was > 5566440. > > The SMART tests you did didn't really amount to anything; no surprise. > short and long tests usually do not test the surface of the disk. There > are some drives which do it on a long test, but as I said before, > everything varies from drive to drive. > > Furthermore, on this model of drive, you cannot do a surface scans via > SMART. Bummer. That's indicated in the "Offline data collection > capabilities" section at the top, where it reads: > > No Selective Self-test supported. > > So you'll have to use the dd method. This takes longer than if surface > scanning was supported by the drive, but is acceptable. I'll get to how > to go about that in a moment. FWIW, I've done a dd read of the entire suspect disk already. Just two errors. From the URL mentioned above: [root@bast:~] # dd of=/dev/null if=/dev/ad2 bs=1m conv=noerror dd: /dev/ad2: Input/output error 2717+0 records in 2717+0 records out 2848980992 bytes transferred in 127.128503 secs (22410246 bytes/sec) dd: /dev/ad2: Input/output error 38170+1 records in 38170+1 records out 40025063424 bytes transferred in 1544.671423 secs (25911701 bytes/sec) [root@bast:~] # That seems to indicate two problems. Are those the values I should be using with dd? I did some more precise testing: # time dd of=/dev/null if=/dev/ad2 bs=512 iseek=5566440 dd: /dev/ad2: Input/output error 9+0 records in 9+0 records out 4608 bytes transferred in 5.368668 secs (858 bytes/sec) real 0m5.429s user 0m0.000s sys 0m0.010s NOTE: that's 9 blocks later than mentioned in smarctl The above generated this in /var/log/messages: Aug 20 17:29:25 bast kernel: ad2: FAILURE - READ_DMA status=51<READY,DSC,ERROR> error=40<UNCORRECTABLE> LBA=5566449 > [stuff snipped] > That said: > > http://jdc.parodius.com/freebsd/bad_block_scan > > If you run this on your ad2 drive, I'm hoping what you'll find are two > LBAs which can't be read -- one will be the remapped LBA and one will be > the "suspect" LBA. If you only get one LBA error then that's fine too, > and will be the "suspect" LBA. > Once you have the LBA(s), you can submit writes to them to get the drive > to re-analyse them (assuming they're "suspect"): > > dd if=/dev/zero of=/dev/XXX bs=512 count=1 seek=NNNNN > > Where XXX is the device and NNNNN is the LBA number. > > If this works properly, the dd command should sit there for a little bit > (as the drive does its re-analysis magic) and then should complete. ad2 is part of a gmirror with ad0. Does this change things? I haven't tried the dd yet. > > You'll want to check SMART stats after that; you should see > Current_Pending_Sector drop to 0. If Offline_Uncorrectable incremented > then the LBA could not be re-read/remapped. It did increment: 197 Current_Pending_Sector 0x0032 100 100 020 Old_age Always - 2 [was 1] > If Reallocated_Sector_Ct > incremented then you now have a total of 2 LBAs which are remapped. It did increment: $ diff smarctl.1 smarctl.3 | grep Reallocated_Sector_Ct < 5 Reallocated_Sector_Ct 0x0033 100 100 020 Pre-fail Always - 1 > 5 Reallocated_Sector_Ct 0x0033 100 100 020 Pre-fail Always > - 2 Full output of smartctl has been appended to http://beta.freebsddiary.org/smart-fixing-bad-sector.php > In > the case of remapping, you get to deal with the UFS/FFS thing above. > To get the stats to update in this situation you *might* (but probably > not) have to run "smartctl -t offline /dev/XXX". I didn't try that... > > You might also be wondering "that dd command writes 512 bytes of zero to > that LBA; what about the old data that was there, in the case that the > drive remaps the LBA?" This is a great question, and one I've never > actually taken the time to answer because at this present time I have > absolutely *no* bad disks in my possession. I'm under the impression > that the write does in fact write zeros if the LBA is remapped, but that > might not be true at all. I've been waiting to test this for quite some > time and document it/write about it. > > I still suggest you replace the drive, although given its age I doubt > you'll be able to find a suitable replacement. I tend to keep disks > like this around for testing/experimental purposes and not for actual > use. I have several unused 80GB HDD I can place into this system. I think that's what I'll wind up doing. But I'd like to follow this process through and get it documented for future reference. -- Dan Langille - http://langille.org _______________________________________________ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"