Hi Pankaj, Not wishing to put words in Linux-Fan's mouth, but my own views areā¦
On Mon, Jan 25, 2021 at 11:04:09AM +0530, Pankaj Jangid wrote: > Linux-Fan <ma_sys...@web.de> writes: > > > * OS data bitrot is not covered, but OS single HDD failure is. > > I achieve this by having OS and Swap on MDADM RAID 1 > > i.e. mirrored but without ZFS. > > I am still learning. > > 1. By "by having OS and Swap on MDADM", did you mean the /boot partition > and swap. When people say, "I put OS and Swap on MDADM" they typically mean the entire installed system before user/service data is put on it. So that's / and all its usual sub-directories, and swap, possibly with things later split off after install. > 2. Why did you put Swap on RAID? What is the advantage? If you have swap used, and the device behind it goes away, your system will likely crash. The point of RAID is to increase availability. If you have the OS itself in RAID and you have swap, the swap should be in RAID too. There are use cases where the software itself provides the availability. For example, there is Ceph, which typically uses simple block devices from multiple hosts and distributes the data around. A valid setup for Ceph is to have the OS in a small RAID just so that a device failure doesn't take down a machine entirely, but then have the data devices stand alone as Ceph itself will handle a failure of those. Small boot+OS devices are cheap and it's so simple to RAID them. Normally Ceph is set up so that an entire host can be lost. If host reinstallation is automatic and quick and there's so many hosts that losing any one of them is a fairly minor occurrence then it could be valid to not even put the OS+swap in RAID. Though for me it still sounds like a lot more hassle than just replacing a dead drive in a running machine, so I wouldn't do it personally. > - I understood that RAID is used to detect disk failures early. Not really. Although with RAID or ZFS or the like it is typical to have a periodic (weekly, monthly, etc) scrub that reads all data and may uncover drive problems like unreadable sectors, usually failures happen when they will happen. The difference is that a copy of the data still exists somewhere else, so that can be used and the failure does not have to propagate to the application. > How do you decide which partition to cover and which not? For each of the storage devices in your system, ask yourself: - Would your system still run if that device suddenly went away? - Would your application(s) still run if that device suddenly went away? - Could finding a replacement device and restoring your data from backups be done in a time span that you consider reasonable? If the answer to those questions are not what you could tolerate, add some redundancy in order to reduce unavailability. If you decide you can tolerate the possible unavailability then so be it. Cheers, Andy -- https://bitfolk.com/ -- No-nonsense VPS hosting