> 1. Is it feasible to run directly against a Cassandra data directory > restored from an EBS snapshot? (as opposed to nodetool snapshots restored > from an EBS snapshot).
Assuming EBS is not buggy, including honor write barriers, including the linux guest kernel etc, then yes. EBS snapshots of a single volumes are promised to be atomic. As such, a restore from an EBS snapshot should be semantically identical to recover after a power outage or sudden reboot of the node. I make no claims as to how well EBS snapshot atomicity is actually tested in practice. > 2. Noting the wiki's consistent Cassandra backups advice; if I schedule > nodetool snapshots across the cluster, should the relative age of the > 'sibling' snapshots be a concern? How far apart can they be before its a > problem? (seconds? minutes? hours?) The only strict requirement from Cassandra's point of view, that I can think of, is the tombstone problem. It is the same as for a node going offline for an extended period; if GC grace times are exceeded than bringing a node back up can cause data that was deleted to re-appear in the cluster. The same is true when restoring a node from an EBS snapshot (essentially equivalent of the node being down for a while). Once you have satisfied that requirement, the remaining concern is mostly that of your application. I.e., to what extent it is acceptable for your application that the cluster contains data representing different points in time. Remember that any data not on the same row key will essentially have their own "timeline" with respect to back/restore, since different rows will never be guaranteed to be contained on overlapping nodes in the cluster. Also be aware that while per-node restores from EBS snapshots is probably a pretty good catastrophic failure recovery technique, do realize that a "total loss and restore" event will have an impact on consistency other than going back in time - unless you can co-ordinate strictly a fully synchronized snapshot across all nodes in the cluster (not really feasible on EC2 without extensive mucking about in userland and temporarily bringing down the cluster). For example, if you do one QUORUM write to row key A followed by a QUORUM write to row key B, and you rely on referential integrity (for example) of data in B referring tot he data in A, that integrity can be broken after a non-globally-consistent restore happens. Whether that is a problem will be entirely up to your application. In any case, after a restore from snapshots, you'll want to run rolling 'nodetool repair':s to make sure all data is replicated as soon as possible to the greatest extent possible. At least, again, if your application benefits from this. The only hard requirement is the repair schedule relative to GC grace time, and that requirement does not change - just be mindful of the timing of the EBS snapshots and what that means to your repair schedule. -- / Peter Schuller