Re: Help, please, understanding AHCI error on amd64

Adam Thompson Mon, 25 Aug 2014 14:31:09 -0700

On 14-08-25 03:49 PM, Dave Anderson wrote:

My amd64 notebook (full dmesg below) has started reporting an error
which I don't adequately understand.  Any explanations or ideas as to
how to figure out exactly what is broken would be greatly appreciated.


Your hard disk is in the process of (hopefully slowly!) breaking.

This started while untarring the ports tree from the source CD
immediately after upgrading from 5.4-release to 5.5-release (from CD).
I initially guessed that it was related to some change in 5.5, but
testing while booted from install CDs for 5.4-release, 5.6-20140822 and
a 4.7-release I had handy all give the same result.

Normal. It won't matter what software you're running because it's ahardware issue.

The error appears to be tied to a particular spot on the disk (it seems
to occur when, e.g., I try to 'ls' a particular directory)

Yes. It'll be some particular sector that the disk controller is havingdifficulty reading. No matter what version of the OS you boot, thosedirectory entries still reside on the same sector on disk.

but it looks
to me like it could be a controller error or perhaps a controller quirk
which OpenBSD doesn't handle well.  The only information about it I can
find is these two messages in /var/log/messages:

Aug 18 14:08:08 minya /bsd: ahci0: attempting to idle device
Aug 18 14:08:08 minya /bsd: ahci0: couldn't recover NCQ error, failing all 
outstanding commands.

Nope. The "quirk" is that your HDD is taking too long to read thatsector (normally because of too many retries), the AHCI stack times out,and the only sane thing to do with timing out a request is to pretendall the other pending commands have also failed - otherwise you couldget undefined results (i.e. even worse errors).

Presumably the HDD eventually manages to read the sector, and succeedsthe time the VFS or block-cache or whatever I/O layer resubmits therequest for that data. Otherwise you'd see other error messagesfollowing the two you mention.

I've hunted through all the other log files I can think of without
finding anything that looks related.  Other than this, the system
appears to be running normally (though I haven't been doing much with it
other than poking around trying to understand this problem).

Nope - this is the only symptom you're likely to see, unless you happento be running some sort of SMART monitor and you happen to be monitoring"correctable read errors" in that tool.

From the hard disk's standpoint, all is well - you asked for a sector,and it (eventually) gave it to you. The only problem is that yoursoftware is too impatient, from a certain point of view.

From a real-world point of view, however, you probably should make sureeverything on that disk is backed up. Then you should either do alow-level format (almost impossible nowadays[1]) and still not trust itfor important data, or just replace it.


-Adam

[1] While low-level formatting is not really possible nowadays unlessyou work in the manufacturer's lab, a few "ATA Secure Erase" passesmight resuscitate the disk for a while if you really, really, REALLYdon't want to replace it right now for some reason. Most people boot aLinux CD to do this, but atactl(8) appears to support the "secerase"command. There are all sorts of things that could prevent you fromdoing this, and if you can't work past them, you probably should justthrow the drive away.


--
-Adam Thompson
 athom...@athompso.net

Re: Help, please, understanding AHCI error on amd64

Reply via email to