Re: cassandra backup

Michael Theroux Fri, 06 Dec 2013 05:41:19 -0800

Hi Marcelo,

Cassandra provides and eventually consistent model for backups.  You can do 
staggered backups of data, with the idea that if you restore a node, and then 
do a repair, your data will be once again consistent.  Cassandra will not 
automatically copy the data to other nodes (other than via hinted handoff).  
You should manually run repair after restoring a node.
  
You should take snapshots when doing a backup, as it keeps the data you are 
backing up relevant to a single point in time, otherwise compaction could 
add/delete files one you mid-backup, or worse, I imagine attempt to access a 
SSTable mid-write.  Snapshots work by using links, and don't take additional 
storage to perform.  In our process we create the snapshot, perform the backup, 
and then clear the snapshot.

One thing to keep in mind in your S3 cost analysis is that, even though storage
is cheap, reads/writes to S3 are not (especially writes). If you are using
LeveledCompaction, or otherwise have a ton of SSTables, some people have
encountered increased costs moving the data to S3.

Ourselves, we maintain backup EBS volumes that we regularly snaphot/rsync data
too. Thus far this has worked very well for us.

-Mike

On Friday, December 6, 2013 8:14 AM, Marcelo Elias Del Valle
<marc...@s1mbi0se.com.br> wrote:

Hello everyone,

I am trying to create backups of my data on AWS. My goal is to store the
backups on S3 or glacier, as it's cheap to store this kind of data. So, if I
have a cluster with N nodes, I would like to copy data from all N nodes to S3
and be able to restore later. I know Priam does that (we were using it), but I
am using the latest cassandra version and we plan to use DSE some time, I am
not sure Priam fits this case.
I took a look at the docs:
http://www.datastax.com/documentation/cassandra/2.0/webhelp/index.html#cassandra/operations/../../cassandra/operations/ops_backup_takes_snapshot_t.html
And I am trying to understand if it's really needed to take a snapshot to
create my backup. Suppose I do a flush and copy the sstables from each node, 1
by one, to s3. Not all at the same time, but one by one.
When I try to restore my backup, data from node 1 will be older than data
from node 2. Will this cause problems? AFAIK, if I am using a replication
factor of 2, for instance, and Cassandra sees data from node X only, it will
automatically copy it to other nodes, right? Is there any chance of cassandra
nodes become corrupt somehow if I do my backups this way?

Best regards,
Marcelo Valle.

Re: cassandra backup

Reply via email to