On Mon, 25 Aug 2014, Adam Thompson wrote: >On 14-08-25 03:49 PM, Dave Anderson wrote: >> My amd64 notebook (full dmesg below) has started reporting an error >> which I don't adequately understand. Any explanations or ideas as to >> how to figure out exactly what is broken would be greatly appreciated. > >Your hard disk is in the process of (hopefully slowly!) breaking. > >> This started while untarring the ports tree from the source CD >> immediately after upgrading from 5.4-release to 5.5-release (from CD). >> I initially guessed that it was related to some change in 5.5, but >> testing while booted from install CDs for 5.4-release, 5.6-20140822 and >> a 4.7-release I had handy all give the same result. > >Normal. It won't matter what software you're running because it's a >hardware issue. > >> The error appears to be tied to a particular spot on the disk (it seems >> to occur when, e.g., I try to 'ls' a particular directory) > >Yes. It'll be some particular sector that the disk controller is having >difficulty reading. No matter what version of the OS you boot, those >directory entries still reside on the same sector on disk. > >> but it looks >> to me like it could be a controller error or perhaps a controller quirk >> which OpenBSD doesn't handle well. The only information about it I can >> find is these two messages in /var/log/messages: >> >> Aug 18 14:08:08 minya /bsd: ahci0: attempting to idle device >> Aug 18 14:08:08 minya /bsd: ahci0: couldn't recover NCQ error, failing all >> outstanding commands. > >Nope. The "quirk" is that your HDD is taking too long to read that >sector (normally because of too many retries), the AHCI stack times out, >and the only sane thing to do with timing out a request is to pretend >all the other pending commands have also failed - otherwise you could >get undefined results (i.e. even worse errors). > >Presumably the HDD eventually manages to read the sector, and succeeds >the time the VFS or block-cache or whatever I/O layer resubmits the >request for that data. Otherwise you'd see other error messages >following the two you mention. > >> I've hunted through all the other log files I can think of without >> finding anything that looks related. Other than this, the system >> appears to be running normally (though I haven't been doing much with it >> other than poking around trying to understand this problem). > >Nope - this is the only symptom you're likely to see, unless you happen >to be running some sort of SMART monitor and you happen to be monitoring >"correctable read errors" in that tool. > > From the hard disk's standpoint, all is well - you asked for a sector, >and it (eventually) gave it to you. The only problem is that your >software is too impatient, from a certain point of view.
That all makes sense. Thanks. It would be nice if that error message mentioned the timeout -- I think that would have convinced me that it was definitely the disk that was dying rather than it possibly being something else. > From a real-world point of view, however, you probably should make sure >everything on that disk is backed up. Then you should either do a >low-level format (almost impossible nowadays[1]) and still not trust it >for important data, or just replace it. > >-Adam > >[1] While low-level formatting is not really possible nowadays unless >you work in the manufacturer's lab, a few "ATA Secure Erase" passes >might resuscitate the disk for a while if you really, really, REALLY >don't want to replace it right now for some reason. Most people boot a >Linux CD to do this, but atactl(8) appears to support the "secerase" >command. There are all sorts of things that could prevent you from >doing this, and if you can't work past them, you probably should just >throw the drive away. Yup, time for a new disk. I'm off to do some research on who makes the most reliable ones these days. [Suggestions from anyone knowledgable are welcome.] Dave -- Dave Anderson <d...@daveanderson.com>