>>>>> "dc" == Daniel Carosone <d...@geek.com.au> writes:
dc> There's a family of platypus in the creek just down the bike dc> path from my house. yeah, happy australiaday. :) What I didn't understand in school is that egg layers like echidnas are not exotic but are pettingzoo/farm/roadkill type animals. IMHO there's a severe taxonomic bias among humans, like a form of OCD, that brings us ridiculous things like ``the seven layer OSI model'' and Tetris and belief in ``RAID edition'' drives. dc> Typically, once a sector fails to read, this is true for that dc> sector (or range of sectors). However, if the sector is dc> overwritten and can be remapped, drives often continue working dc> flawlessly for a long time thereafter. While I don't doubt you had that experience (I've had it too), I was mostly thinking of the google paper: http://labs.google.com/papers/disk_failures.html They focus on temperature, which makes sense because it's $$$: spend on cooling, or on replacing drives? and tehy find even >45C does not increase failures until the third year, so basically just forget about it, and forget also about MTBF estimates based on silly temperature timewarp claims and pay attention to their numbers instead. But the interesting result for TLER/ERC is on page 7 figure 7, where you see within the first two years the effect of reallocation on expected life is very pronounced, and they say ``after their first reallocation, drives are over 14 times more likely to fail within 60 days than drives without reallocation counts, making the critical thereshold for this parameter also '1'.'' It also says drives which fail the 'smartctl -t long' test (again, this part of smartctl is broken on solaris :( plz keep in the back of your mind :), which checks that every sector on the medium is readable, are ``39 times more likely to fail within 60 days than drives without scan errors.'' so...this suggests to me that read errors are not so much things that happen from time to time even with good drives, and therefore there is not much point in trying to write data into an unreadable sector (to remap it) or to worry about squeezing one marginal sector out of an unredundant desktop drive (the drive's bad---warn OS, recover data, replace it). One of the things that's known to cause bad sectors is high-flying writes, and all the google-studied drives were in data centers, so some of this might not be true of laptop drives that get knocked around a fair bit. dc> Once they've run out of remapped sectors, or have started dc> consistently producing errors, then they're cactus. Do pay dc> attention to the smart error counts and predictors. yes, well, you can't even read these counters on Solaris because smartctl doesn't make it through the SATA stack, so ``do pay attention to'' isn't very practical advice. but if you have Linux, the advice of the google paper is to look at the remapped sector count (is it zero, or more than zero?), and IIRC that sometimes the ``seek error rate'' can be compared among identical model drives but is useless otherwise. The ``overall health assessment'' is obviously useless, but hopefully I don't need to tell anyone that. The 'smartctl -t long' test is my favorite, but it's proactive. Anyway the main result I'm interested here is what I just said, that unreadable sectors are not a poisson process. They're strong indicators of drives about to fail, ``the critical threshhold is '1' '', and not things around which you can usefully plan cargocult baroque spaghetti rereading strategies. dc> The best practices of regular scrubs and sufficiently dc> redundant pools and separate backups stand, in spite of and dc> indeed because of such idiocy. ok, but the new thing that I'm arguing is that TLER/ERC is a completely useless adaptation to a quirk of RAID card firmware and has nothing to do with ZFS, nor with best RAID practices in general. I'm not certain this statement is true, but from what I've heard so far that's what I think.
pgpmRZbU3eobe.pgp
Description: PGP signature
_______________________________________________ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss