Re: Data and hardware protection measures

Michael Kjörling Sun, 28 Jan 2024 11:23:39 -0800

On 28 Jan 2024 19:19 +0100, from h...@adminart.net (hw):
> On Fri, 2024-01-26 at 15:56 +0000, Michael Kjörling wrote:
>> On 26 Jan 2024 16:11 +0100, from h...@adminart.net (hw):
>>> I rather spend the money on new batteries (EUR 40 last time after 5
>>> years) every couple years [...]
> 
> To comment myself, I think was 3 years, not 5, sorry.
> 
>>> The hardware is usually extremely difficult --- and may be impossible
>>> --- to replace.
>> 
>> And let's not forget that you can _plan_ to perform the battery
>> replacement for whenever that is convenient.
> 
> How do you know in advance when the battery will have failed?


You replace the battery before it fails completely.

Most batteries don't go from perfectly fine to completely dead within
one charge cycle.

If the battery drains completely during a power outage before the UPS
has a chance to respond to the battery's loss of capacity, that
becomes a (hopefully clean) power cut, which _still_ is _a lot_ better
than equipment which isn't designed to deal with a significant
overvoltage condition taking the brunt of a lightning strike.

I'm assuming, of course, that you replace the battery with one of the
same chemistry. The UPS will probably assume some discharge
characteristic depending on what battery type the OEM uses (lead acid,
NiCd, NiMH, LiIon, ...); of course if you give the UPS a battery using
some other chemistry, that'll immediately wreak havoc with lots of
things.


>> Which is quite the contrast to a lightning strike blowing out even
>> _just_ the PSU and it needing replacement before you can even use
>> the computer again (and you _hope_ that nothing more took a hit,
>> which it probably did even if the computer _seems_ to be working
>> fine).
> 
> It would also hit the display(s), the switches and through that
> everything that's connected to the network, the server(s) ...  That
> adds up to a lot of money.

Which is why I said "even _just_ the PSU", emphasis original.


>> It's also worth talking to your local electrician about installing an
>> incoming-mains overvoltage protection for lightning protection.
> 
> Hm I thought it's expensive.

So did I until I actually asked someone who could give me a quote for
actually installing it.


> That doesn't exactly help when the failed disk has disappeared
> altogether, as if it had been removed ;)

If that happens, I'd get output along the lines of:

# zpool status
  pool: tank
 state: DEGRADED
  scan: scrub repaired <n>B in <amount of time> with <n> errors on <date and 
time>
config:

        NAME                              STATE     READ WRITE CKSUM
        tank                              ONLINE       0     0     0
          raidz2-0                        ONLINE       0     0     0
            wwn-0x0000000000000001-crypt  ONLINE       0     0     0
            8446744073709551616           UNAVAIL      0     0     0  was 
/dev/mapper/wwn-0x1111111111111113-crypt
            wwn-0x2222222222222225-crypt  ONLINE       0     0     0
            wwn-0x3333333333333337-crypt  ONLINE       0     0     0
            wwn-0x4444444444444449-crypt  ONLINE       0     0     0
            wwn-0x555555555555555b-crypt  ONLINE       0     0     0

clearly identifying the problem. And also most likely a lot of event
notifications telling me that wwn-0x1111111111111113-crypt is having
issues within the "tank" pool, plus any applicable kernel logs for the
device disconnection and perhaps lower-level I/O errors. Similarly, if
a storage device suddenly starts returning garbage, that will show up
likely as CKSUM errors and the device will eventually get kicked out
of the pool, showing as state FAILED with large error counter values.

(zpool status would also provide some more explanatory details, in the
example above including that "applications are unaffected" because
sufficient redundancy would still exist; but I'm eliding those here
because I don't have them handy and don't feel like creating such a
situation just to get example output. The important part is that the
disk that dropped off the bus will show as likely UNAVAIL with its
internal identifier and a reference to its WWN because of my naming
scheme, instead of as completely missing. Solution is to get a
replacement disk, plug it in, execute "sudo zpool replace tank
$numeric_id $new_device_path", and wait a while, all the while I can
still use the system normally.)

No matter what kind of storage solution you're using - hardware RAID,
software RAID, no redundancy, whichever - or how you're doing backups
(assuming that you are, for some value of "you"), you can't just
ignore issues with it. That way lies data loss.

-- 
Michael Kjörling                     🔗 https://michael.kjorling.se
“Remember when, on the Internet, nobody cared that you were a dog?”

Re: Data and hardware protection measures

Reply via email to