Re: Raid 1

Andy Smith Sun, 24 Jan 2021 23:18:36 -0800

Hi Pankaj,

Not wishing to put words in Linux-Fan's mouth, but my own views
are…

On Mon, Jan 25, 2021 at 11:04:09AM +0530, Pankaj Jangid wrote:
> Linux-Fan <ma_sys...@web.de> writes:
> 
> > * OS data bitrot is not covered, but OS single HDD failure is.
> >   I achieve this by having OS and Swap on MDADM RAID 1
> >   i.e. mirrored but without ZFS.
> 
> I am still learning.
> 
> 1. By "by having OS and Swap on MDADM", did you mean the /boot partition
>    and swap.

When people say, "I put OS and Swap on MDADM" they typically mean
the entire installed system before user/service data is put on it.
So that's / and all its usual sub-directories, and swap, possibly
with things later split off after install.

> 2. Why did you put Swap on RAID? What is the advantage?

If you have swap used, and the device behind it goes away, your
system will likely crash.

The point of RAID is to increase availability. If you have the OS
itself in RAID and you have swap, the swap should be in RAID too.

There are use cases where the software itself provides the
availability. For example, there is Ceph, which typically uses
simple block devices from multiple hosts and distributes the data
around.

A valid setup for Ceph is to have the OS in a small RAID just so
that a device failure doesn't take down a machine entirely, but then
have the data devices stand alone as Ceph itself will handle a
failure of those. Small boot+OS devices are cheap and it's so simple
to RAID them.

Normally Ceph is set up so that an entire host can be lost. If host
reinstallation is automatic and quick and there's so many hosts that
losing any one of them is a fairly minor occurrence then it could be
valid to not even put the OS+swap in RAID. Though for me it still
sounds like a lot more hassle than just replacing a dead drive in a
running machine, so I wouldn't do it personally.

>    - I understood that RAID is used to detect disk failures early.

Not really. Although with RAID or ZFS or the like it is typical to
have a periodic (weekly, monthly, etc) scrub that reads all data and
may uncover drive problems like unreadable sectors, usually failures
happen when they will happen. The difference is that a copy of the
data still exists somewhere else, so that can be used and the
failure does not have to propagate to the application.

> How do you decide which partition to cover and which not?

For each of the storage devices in your system, ask yourself:

- Would your system still run if that device suddenly went away?

- Would your application(s) still run if that device suddenly went
  away?

- Could finding a replacement device and restoring your data from
  backups be done in a time span that you consider reasonable?

If the answer to those questions are not what you could tolerate,
add some redundancy in order to reduce unavailability. If you decide
you can tolerate the possible unavailability then so be it.

Cheers,
Andy

-- 
https://bitfolk.com/ -- No-nonsense VPS hosting

Re: Raid 1

Reply via email to