Re: corrupted edits log after power failure

Steve Loughran Mon, 26 Sep 2011 07:35:10 -0700

On 22/09/11 20:15, Brian Bockelman wrote:

Hi Gabi,


I'd be a bit scared of that backup strategy; what happens if the TCP connection 
gets cut suddenly during curl?  What happens if there's a TCP corruption?  Such 
things have happened before.

Curl might work for long-haul backups, but I'd use HTTPS for its betterchecksums, and have alternate in-cluster strategies, such as shared HAfilesystems


Personally, we have the SNN merge the edits every 15 minutes.  If it hasn't 
happened in 30 minutes, people get emailed.  If it doesn't happen in 45 
minutes, people get paged.

That's a good technique for verifying the SNN is actually working.Thinking it is working, when it isn't is danger

In addition to writing out copies to a few disks and to NFS, we also have a 
versioned backup of the checkpoint.prev.

The worst case scenario would be if the SNN corrupts the image and uploads the 
corrupt image (it's a theoretical situation so far...); this would be caught at 
the next merge, meaning we trash up to 30 minutes of work.  This would ruin 
someone's day, but not someone's week.

The NN is a SPOF, and should be treated with an appropriate level of paranoia 
(and, because it is a SPOF, assume that it will fail anyway and make sure you 
can accept the consequences).


That is: test your handling of the outage on a regular basis.

Re: corrupted edits log after power failure

Reply via email to