On 22/09/11 20:15, Brian Bockelman wrote:
Hi Gabi,
I'd be a bit scared of that backup strategy; what happens if the TCP connection
gets cut suddenly during curl? What happens if there's a TCP corruption? Such
things have happened before.
Curl might work for long-haul backups, but I'd use HTTPS for its better
checksums, and have alternate in-cluster strategies, such as shared HA
filesystems
Personally, we have the SNN merge the edits every 15 minutes. If it hasn't
happened in 30 minutes, people get emailed. If it doesn't happen in 45
minutes, people get paged.
That's a good technique for verifying the SNN is actually working.
Thinking it is working, when it isn't is danger
In addition to writing out copies to a few disks and to NFS, we also have a
versioned backup of the checkpoint.prev.
The worst case scenario would be if the SNN corrupts the image and uploads the
corrupt image (it's a theoretical situation so far...); this would be caught at
the next merge, meaning we trash up to 30 minutes of work. This would ruin
someone's day, but not someone's week.
The NN is a SPOF, and should be treated with an appropriate level of paranoia
(and, because it is a SPOF, assume that it will fail anyway and make sure you
can accept the consequences).
That is: test your handling of the outage on a regular basis.