On Mon, 25 Aug 2014, Adam Thompson wrote:

>On 14-08-25 03:49 PM, Dave Anderson wrote:
>> My amd64 notebook (full dmesg below) has started reporting an error
>> which I don't adequately understand.  Any explanations or ideas as to
>> how to figure out exactly what is broken would be greatly appreciated.
>
>Your hard disk is in the process of (hopefully slowly!) breaking.
>
>> This started while untarring the ports tree from the source CD
>> immediately after upgrading from 5.4-release to 5.5-release (from CD).
>> I initially guessed that it was related to some change in 5.5, but
>> testing while booted from install CDs for 5.4-release, 5.6-20140822 and
>> a 4.7-release I had handy all give the same result.
>
>Normal.  It won't matter what software you're running because it's a
>hardware issue.
>
>> The error appears to be tied to a particular spot on the disk (it seems
>> to occur when, e.g., I try to 'ls' a particular directory)
>
>Yes.  It'll be some particular sector that the disk controller is having
>difficulty reading.  No matter what version of the OS you boot, those
>directory entries still reside on the same sector on disk.
>
>> but it looks
>> to me like it could be a controller error or perhaps a controller quirk
>> which OpenBSD doesn't handle well.  The only information about it I can
>> find is these two messages in /var/log/messages:
>>
>> Aug 18 14:08:08 minya /bsd: ahci0: attempting to idle device
>> Aug 18 14:08:08 minya /bsd: ahci0: couldn't recover NCQ error, failing all 
>> outstanding commands.
>
>Nope.  The "quirk" is that your HDD is taking too long to read that
>sector (normally because of too many retries), the AHCI stack times out,
>and the only sane thing to do with timing out a request is to pretend
>all the other pending commands have also failed - otherwise you could
>get undefined results (i.e. even worse errors).
>
>Presumably the HDD eventually manages to read the sector, and succeeds
>the time the VFS or block-cache or whatever I/O layer resubmits the
>request for that data.  Otherwise you'd see other error messages
>following the two you mention.
>
>> I've hunted through all the other log files I can think of without
>> finding anything that looks related.  Other than this, the system
>> appears to be running normally (though I haven't been doing much with it
>> other than poking around trying to understand this problem).
>
>Nope - this is the only symptom you're likely to see, unless you happen
>to be running some sort of SMART monitor and you happen to be monitoring
>"correctable read errors" in that tool.
>
> From the hard disk's standpoint, all is well - you asked for a sector,
>and it (eventually) gave it to you.  The only problem is that your
>software is too impatient, from a certain point of view.

That all makes sense.  Thanks.

It would be nice if that error message mentioned the timeout -- I think
that would have convinced me that it was definitely the disk that was
dying rather than it possibly being something else.

> From a real-world point of view, however, you probably should make sure
>everything on that disk is backed up.  Then you should either do a
>low-level format (almost impossible nowadays[1]) and still not trust it
>for important data, or just replace it.
>
>-Adam
>
>[1] While low-level formatting is not really possible nowadays unless
>you work in the manufacturer's lab, a few "ATA Secure Erase" passes
>might resuscitate the disk for a while if you really, really, REALLY
>don't want to replace it right now for some reason.  Most people boot a
>Linux CD to do this, but atactl(8) appears to support the "secerase"
>command.  There are all sorts of things that could prevent you from
>doing this, and if you can't work past them, you probably should just
>throw the drive away.

Yup, time for a new disk.  I'm off to do some research on who makes the
most reliable ones these days.  [Suggestions from anyone knowledgable
are welcome.]

        Dave

-- 
Dave Anderson
<d...@daveanderson.com>

Reply via email to