I've been wondering about how to cope with random hardware failures when data is being received from a WAN and written to local storage. As I understand it, CARP(4) will enable any one of N machines to handle incoming requests, so hardware failure of up to N-1 machines will be handled.
But if each of these machines writes received data (e.g. emails) to a shared hard drive, then we are back to a single point of failure (if each machine writes to its own individual hard drive(s) then we end up with no sharing of data). We can make the drive use RAID, but RAID controllers can also fail. One way of handling this would be to write a filesystem that copies the contents of modified files over a network before close() returns. That way, as long as a SMTP server (say) checks the return from close() before telling the sender that it has received everything ok, we can avoid any single point of failure. If the data is copied to all the other machines in a CARP `family', then we should end up with perfectly syncronised machines, each of which can take over at any time. The obvious downside is potential speed problems. This has the nice property that the unit of replacement is individual machines, with no need for complicated and expensive hardware like Network Addressed Storage/RAID. If something fails, install a fresh machine, sync its hard drives a few times with one of the other machines (whose contents will be changing due to incoming data from the WAN), temporarily turn off the WAN, sync a final time, and restore the WAN. I've written a simple test library dupfs that does this by intercepting open() and close() with LD_PRELOAD, using system( "rsync ...") to do the synronisation, and it works in trivial test cases. Any simple-minded file-locking by dupfs would lead to deadlock I think, so something else (CARP?) would have to ensure that only one of a number of machines was active at any time. I expected there to be standard solutions to this sort of problem, but I was unable to find anything which didn't involve expensive hardware. ISPs seem to accept that they will suffer downtime due to hardware failure, and occasionally lose emails. So, am I barking up the wrong tree here? What am I missing? - Julian -- http://www.op59.net/