On 14-08-25 03:49 PM, Dave Anderson wrote:
My amd64 notebook (full dmesg below) has started reporting an error
which I don't adequately understand.  Any explanations or ideas as to
how to figure out exactly what is broken would be greatly appreciated.

Your hard disk is in the process of (hopefully slowly!) breaking.

This started while untarring the ports tree from the source CD
immediately after upgrading from 5.4-release to 5.5-release (from CD).
I initially guessed that it was related to some change in 5.5, but
testing while booted from install CDs for 5.4-release, 5.6-20140822 and
a 4.7-release I had handy all give the same result.

Normal. It won't matter what software you're running because it's a hardware issue.

The error appears to be tied to a particular spot on the disk (it seems
to occur when, e.g., I try to 'ls' a particular directory)

Yes. It'll be some particular sector that the disk controller is having difficulty reading. No matter what version of the OS you boot, those directory entries still reside on the same sector on disk.

but it looks
to me like it could be a controller error or perhaps a controller quirk
which OpenBSD doesn't handle well.  The only information about it I can
find is these two messages in /var/log/messages:

Aug 18 14:08:08 minya /bsd: ahci0: attempting to idle device
Aug 18 14:08:08 minya /bsd: ahci0: couldn't recover NCQ error, failing all 
outstanding commands.

Nope. The "quirk" is that your HDD is taking too long to read that sector (normally because of too many retries), the AHCI stack times out, and the only sane thing to do with timing out a request is to pretend all the other pending commands have also failed - otherwise you could get undefined results (i.e. even worse errors).

Presumably the HDD eventually manages to read the sector, and succeeds the time the VFS or block-cache or whatever I/O layer resubmits the request for that data. Otherwise you'd see other error messages following the two you mention.

I've hunted through all the other log files I can think of without
finding anything that looks related.  Other than this, the system
appears to be running normally (though I haven't been doing much with it
other than poking around trying to understand this problem).

Nope - this is the only symptom you're likely to see, unless you happen to be running some sort of SMART monitor and you happen to be monitoring "correctable read errors" in that tool.

From the hard disk's standpoint, all is well - you asked for a sector, and it (eventually) gave it to you. The only problem is that your software is too impatient, from a certain point of view.

From a real-world point of view, however, you probably should make sure everything on that disk is backed up. Then you should either do a low-level format (almost impossible nowadays[1]) and still not trust it for important data, or just replace it.

-Adam

[1] While low-level formatting is not really possible nowadays unless you work in the manufacturer's lab, a few "ATA Secure Erase" passes might resuscitate the disk for a while if you really, really, REALLY don't want to replace it right now for some reason. Most people boot a Linux CD to do this, but atactl(8) appears to support the "secerase" command. There are all sorts of things that could prevent you from doing this, and if you can't work past them, you probably should just throw the drive away.

--
-Adam Thompson
 athom...@athompso.net

Reply via email to