We built a cluster with 8 nodes using Ignite persistence and 1 backup, and had two nodes fail at different times, the first being storage did not get mounted and ran out of space early, and the second an SSD failed. There are some things that we could have done better, but this event brings up the question of how backups are distributed.
There are two approaches that have substantially different behavior on double faults, and double faults are more likely at scale. 1) random placement of backup partitions relative to primary 2) backup partitions have similar affinity to the primary partitions, where in the extreme nodes are paired so that primaries on the node pair have backups on the other node of the pair With a 64 node cluster, #2 would have 1/63th of the likelihood of data loss when 2 nodes fail vs #1. I'm guessing that ignite ships with #1, but we could provide our own affinity function which would accomplish #2 if we chose? Disclaimer The information contained in this communication from the sender is confidential. It is intended solely for use by the recipient and others authorized to receive it. If you are not the recipient, you are hereby notified that any disclosure, copying, distribution or taking action in relation of the contents of this information is strictly prohibited and may be unlawful. This email has been scanned for viruses and malware, and may have been automatically archived by Mimecast Ltd, an innovator in Software as a Service (SaaS) for business. Providing a safer and more useful place for your human generated data. Specializing in; Security, archiving and compliance. To find out more visit the Mimecast website.
