On 14-08-25 03:49 PM, Dave Anderson wrote:
My amd64 notebook (full dmesg below) has started reporting an error
which I don't adequately understand. Any explanations or ideas as to
how to figure out exactly what is broken would be greatly appreciated.
Your hard disk is in the process of (hopefully slowly!) breaking.
This started while untarring the ports tree from the source CD
immediately after upgrading from 5.4-release to 5.5-release (from CD).
I initially guessed that it was related to some change in 5.5, but
testing while booted from install CDs for 5.4-release, 5.6-20140822 and
a 4.7-release I had handy all give the same result.
Normal. It won't matter what software you're running because it's a
hardware issue.
The error appears to be tied to a particular spot on the disk (it seems
to occur when, e.g., I try to 'ls' a particular directory)
Yes. It'll be some particular sector that the disk controller is having
difficulty reading. No matter what version of the OS you boot, those
directory entries still reside on the same sector on disk.
but it looks
to me like it could be a controller error or perhaps a controller quirk
which OpenBSD doesn't handle well. The only information about it I can
find is these two messages in /var/log/messages:
Aug 18 14:08:08 minya /bsd: ahci0: attempting to idle device
Aug 18 14:08:08 minya /bsd: ahci0: couldn't recover NCQ error, failing all
outstanding commands.
Nope. The "quirk" is that your HDD is taking too long to read that
sector (normally because of too many retries), the AHCI stack times out,
and the only sane thing to do with timing out a request is to pretend
all the other pending commands have also failed - otherwise you could
get undefined results (i.e. even worse errors).
Presumably the HDD eventually manages to read the sector, and succeeds
the time the VFS or block-cache or whatever I/O layer resubmits the
request for that data. Otherwise you'd see other error messages
following the two you mention.
I've hunted through all the other log files I can think of without
finding anything that looks related. Other than this, the system
appears to be running normally (though I haven't been doing much with it
other than poking around trying to understand this problem).
Nope - this is the only symptom you're likely to see, unless you happen
to be running some sort of SMART monitor and you happen to be monitoring
"correctable read errors" in that tool.
From the hard disk's standpoint, all is well - you asked for a sector,
and it (eventually) gave it to you. The only problem is that your
software is too impatient, from a certain point of view.
From a real-world point of view, however, you probably should make sure
everything on that disk is backed up. Then you should either do a
low-level format (almost impossible nowadays[1]) and still not trust it
for important data, or just replace it.
-Adam
[1] While low-level formatting is not really possible nowadays unless
you work in the manufacturer's lab, a few "ATA Secure Erase" passes
might resuscitate the disk for a while if you really, really, REALLY
don't want to replace it right now for some reason. Most people boot a
Linux CD to do this, but atactl(8) appears to support the "secerase"
command. There are all sorts of things that could prevent you from
doing this, and if you can't work past them, you probably should just
throw the drive away.
--
-Adam Thompson
athom...@athompso.net