On 9 Jan 2024 08:11 -0500, from wande...@fastmail.fm (The Wanderer): > Within the past few weeks, I got root-mail notifications from smartd > that the ATA error count on two of the drives had increased - one from 0 > to a fairly low value (I think between 10 and 20), the other from 0 to > 1. I figured this was nothing to worry about - because of the relatively > low values, because the other drives had not shown any such thing, and > because of the expected stability and lifetime of good-quality SSDs. > > > On Sunday (two days ago), I got root-mail notifications from smartd > about *all* of the drives in the array. This time, the total error > counts had gone up to values in the multiple hundreds per drive. Since > then (yesterday), I've also gotten further notification mails about at > least one of the drives increasing further. So far today I have not > gotten any such notifications.
A single or a few bad blocks is nothing to be overly concerned about. I had an Intel SSD which lived a long, healthy, happy life with one bad sector and never gave any signs of further problems. Hundreds of bad blocks per drive is certainly cause for concern. More worrying is a _significant increase in the rate of increase_ of the bad blocks count. That suggests that the drive is suffering from some underlying problem. > So... as the Subject asks, should I be worried? How do I interpret these > results, and at what point do they start to reflect something to take > action over? If there is not reason to be worried, what *do* these > alerts indicate, and at what point *should* I start to be worried about > them? At an absolute minimum, were it me, I would refresh my backups. As 8-wide RAID-6 of 2TB drives nets you about 12 TB of storage, I'd say get yourself a ~16 TB external rotational HDD and set up to backup onto it. You should have backups anyway; there's no time like the present to get started. You are admittedly in a much better position than many; if the errors are randomly located, odds are that you have sufficient redundancy to manage within the storage array. The good part is if you look at SMART attributes 5 and 179; taken in combination, I take them as indication that all (31) reallocated sectors have been reallocated into the spare sectors pool, and this represents approximately 2% of the spare sectors pool. Absolutely do keep an eye on attribute 179. If the spare sectors pool start to fill up, the drive won't be able to reallocate any further sectors, and your RAID array won't do you much good. I would also keep an eye out for I/O errors in the kernel log, but be mindful of which devices they are coming from. -- Michael Kjörling 🔗 https://michael.kjorling.se “Remember when, on the Internet, nobody cared that you were a dog?”