On 28 Jan 2024 19:19 +0100, from h...@adminart.net (hw): > On Fri, 2024-01-26 at 15:56 +0000, Michael Kjörling wrote: >> On 26 Jan 2024 16:11 +0100, from h...@adminart.net (hw): >>> I rather spend the money on new batteries (EUR 40 last time after 5 >>> years) every couple years [...] > > To comment myself, I think was 3 years, not 5, sorry. > >>> The hardware is usually extremely difficult --- and may be impossible >>> --- to replace. >> >> And let's not forget that you can _plan_ to perform the battery >> replacement for whenever that is convenient. > > How do you know in advance when the battery will have failed?
You replace the battery before it fails completely. Most batteries don't go from perfectly fine to completely dead within one charge cycle. If the battery drains completely during a power outage before the UPS has a chance to respond to the battery's loss of capacity, that becomes a (hopefully clean) power cut, which _still_ is _a lot_ better than equipment which isn't designed to deal with a significant overvoltage condition taking the brunt of a lightning strike. I'm assuming, of course, that you replace the battery with one of the same chemistry. The UPS will probably assume some discharge characteristic depending on what battery type the OEM uses (lead acid, NiCd, NiMH, LiIon, ...); of course if you give the UPS a battery using some other chemistry, that'll immediately wreak havoc with lots of things. >> Which is quite the contrast to a lightning strike blowing out even >> _just_ the PSU and it needing replacement before you can even use >> the computer again (and you _hope_ that nothing more took a hit, >> which it probably did even if the computer _seems_ to be working >> fine). > > It would also hit the display(s), the switches and through that > everything that's connected to the network, the server(s) ... That > adds up to a lot of money. Which is why I said "even _just_ the PSU", emphasis original. >> It's also worth talking to your local electrician about installing an >> incoming-mains overvoltage protection for lightning protection. > > Hm I thought it's expensive. So did I until I actually asked someone who could give me a quote for actually installing it. > That doesn't exactly help when the failed disk has disappeared > altogether, as if it had been removed ;) If that happens, I'd get output along the lines of: # zpool status pool: tank state: DEGRADED scan: scrub repaired <n>B in <amount of time> with <n> errors on <date and time> config: NAME STATE READ WRITE CKSUM tank ONLINE 0 0 0 raidz2-0 ONLINE 0 0 0 wwn-0x0000000000000001-crypt ONLINE 0 0 0 8446744073709551616 UNAVAIL 0 0 0 was /dev/mapper/wwn-0x1111111111111113-crypt wwn-0x2222222222222225-crypt ONLINE 0 0 0 wwn-0x3333333333333337-crypt ONLINE 0 0 0 wwn-0x4444444444444449-crypt ONLINE 0 0 0 wwn-0x555555555555555b-crypt ONLINE 0 0 0 clearly identifying the problem. And also most likely a lot of event notifications telling me that wwn-0x1111111111111113-crypt is having issues within the "tank" pool, plus any applicable kernel logs for the device disconnection and perhaps lower-level I/O errors. Similarly, if a storage device suddenly starts returning garbage, that will show up likely as CKSUM errors and the device will eventually get kicked out of the pool, showing as state FAILED with large error counter values. (zpool status would also provide some more explanatory details, in the example above including that "applications are unaffected" because sufficient redundancy would still exist; but I'm eliding those here because I don't have them handy and don't feel like creating such a situation just to get example output. The important part is that the disk that dropped off the bus will show as likely UNAVAIL with its internal identifier and a reference to its WWN because of my naming scheme, instead of as completely missing. Solution is to get a replacement disk, plug it in, execute "sudo zpool replace tank $numeric_id $new_device_path", and wait a while, all the while I can still use the system normally.) No matter what kind of storage solution you're using - hardware RAID, software RAID, no redundancy, whichever - or how you're doing backups (assuming that you are, for some value of "you"), you can't just ignore issues with it. That way lies data loss. -- Michael Kjörling 🔗 https://michael.kjorling.se “Remember when, on the Internet, nobody cared that you were a dog?”