Re: SMART Uncorrectable_Error_Cnt rising - should I be worried?

Michael Kjörling Tue, 09 Jan 2024 08:21:53 -0800

On 9 Jan 2024 08:11 -0500, from [email protected] (The Wanderer):
> Within the past few weeks, I got root-mail notifications from smartd
> that the ATA error count on two of the drives had increased - one from 0
> to a fairly low value (I think between 10 and 20), the other from 0 to
> 1. I figured this was nothing to worry about - because of the relatively
> low values, because the other drives had not shown any such thing, and
> because of the expected stability and lifetime of good-quality SSDs.
> 
> 
> On Sunday (two days ago), I got root-mail notifications from smartd
> about *all* of the drives in the array. This time, the total error
> counts had gone up to values in the multiple hundreds per drive. Since
> then (yesterday), I've also gotten further notification mails about at
> least one of the drives increasing further. So far today I have not
> gotten any such notifications.


A single or a few bad blocks is nothing to be overly concerned about.
I had an Intel SSD which lived a long, healthy, happy life with one
bad sector and never gave any signs of further problems.

Hundreds of bad blocks per drive is certainly cause for concern.

More worrying is a _significant increase in the rate of increase_ of
the bad blocks count. That suggests that the drive is suffering from
some underlying problem.


> So... as the Subject asks, should I be worried? How do I interpret these
> results, and at what point do they start to reflect something to take
> action over? If there is not reason to be worried, what *do* these
> alerts indicate, and at what point *should* I start to be worried about
> them?

At an absolute minimum, were it me, I would refresh my backups. As
8-wide RAID-6 of 2TB drives nets you about 12 TB of storage, I'd say
get yourself a ~16 TB external rotational HDD and set up to backup
onto it. You should have backups anyway; there's no time like the
present to get started.

You are admittedly in a much better position than many; if the errors
are randomly located, odds are that you have sufficient redundancy to
manage within the storage array.

The good part is if you look at SMART attributes 5 and 179; taken in
combination, I take them as indication that all (31) reallocated
sectors have been reallocated into the spare sectors pool, and this
represents approximately 2% of the spare sectors pool.

Absolutely do keep an eye on attribute 179. If the spare sectors pool
start to fill up, the drive won't be able to reallocate any further
sectors, and your RAID array won't do you much good.

I would also keep an eye out for I/O errors in the kernel log, but be
mindful of which devices they are coming from.

-- 
Michael Kjörling                     🔗 https://michael.kjorling.se
“Remember when, on the Internet, nobody cared that you were a dog?”

Re: SMART Uncorrectable_Error_Cnt rising - should I be worried?

Reply via email to