Filesystem redundancy

Julian Smith Wed, 16 Nov 2005 03:17:58 -0800

I've been wondering about how to cope with random hardware failures when
data is being received from a WAN and written to local storage. As I
understand it, CARP(4) will enable any one of N machines to handle
incoming requests, so hardware failure of up to N-1 machines will be
handled.


But if each of these machines writes received data (e.g. emails) to a
shared hard drive, then we are back to a single point of failure (if
each machine writes to its own individual hard drive(s) then we end up
with no sharing of data). We can make the drive use RAID, but RAID
controllers can also fail.

One way of handling this would be to write a filesystem that copies the
contents of modified files over a network before close() returns. That
way, as long as a SMTP server (say) checks the return from close()
before telling the sender that it has received everything ok, we can
avoid any single point of failure. If the data is copied to all the
other machines in a CARP `family', then we should end up with perfectly
syncronised machines, each of which can take over at any time. The
obvious downside is potential speed problems.

This has the nice property that the unit of replacement is individual
machines, with no need for complicated and expensive hardware like
Network Addressed Storage/RAID. If something fails, install a fresh
machine, sync its hard drives a few times with one of the other machines
(whose contents will be changing due to incoming data from the WAN),
temporarily turn off the WAN, sync a final time, and restore the WAN.

I've written a simple test library dupfs that does this by intercepting
open() and close() with LD_PRELOAD, using system( "rsync ...") to do the
synronisation, and it works in trivial test cases. Any simple-minded
file-locking by dupfs would lead to deadlock I think, so something else
(CARP?) would have to ensure that only one of a number of machines was
active at any time.

I expected there to be standard solutions to this sort of problem, but I
was unable to find anything which didn't involve expensive hardware.
ISPs seem to accept that they will suffer downtime due to hardware
failure, and occasionally lose emails.

So, am I barking up the wrong tree here? What am I missing?

- Julian

-- 
http://www.op59.net/

Filesystem redundancy

Reply via email to