Artur,

Replies inline.

On Fri, May 2, 2014 at 10:42 AM, Artur Kronenberg <
artur.kronenb...@openmarket.com> wrote:

> we are running a 7 node cluster with an RF of 5. Each node holds about 70%
> of the data and we are now wondering about the backup process.
>

What are you using for a backup process at the moment? Or, even just your
application stack. If you're using Amazon's AWS it is simple to get started
with a project like tablesnap <https://github.com/JeremyGrosser/tablesnap>,
which listens for new sstables and uploads them to S3.

You can also take snapshots of the data on each node with 'nodetool
snapshot', and move the data manually.


>  1. Is there a best practice procedure or a tool that we can use to have
> one backup that holds 100 % of the data or is it necessary for us to take
> multiple backups.
>

Backups on a distributed system generally refers to the concept that you
have captured the state of the database at a particular point in time. The
size and spread of your data will be the limiting factor in having one
backup — you can store the data from each node on a single computer, you
just won't be able to combine the data into one node without some extra
legwork.


> 2. If we have to use multiple backups, is there a way to combine them? We
> would like to be able to start up a 1 node cluster that holds 100% of data
> if necessary. Can we just chug all sstables into the data directory and
> cassandra will figure out the rest?
>


> 4. If all of the above would work, could we in case of emergency setup a
> massive 1-node cluster that holds 100 % of the data and repair the rest of
> our cluster based of this? E.g. have the 1 node run with the correct data,
> and then hook it into our existing cluster and call repair on it to restore
> data on the rest of our nodes?
>

You could bulk load the sstable data to a smaller cluster using the
'sstableloader' tool. I gave a
webinar<https://www.youtube.com/watch?v=00weETpk3Yo> for
Planet Cassandra a few months ago about how to backfill in data to your
cluster, this could help here.

3. How do we handle the commitlog files from all of our nodes? Given we'd
> like to restore to a certain point in time and we have all the commitlogs,
> can we have commitlogs from multiple locations in the commitlog folder and
> cassandra will pick and execute the right thing?
>

You'll want to use 'nodetool
drain<http://www.datastax.com/documentation/cassandra/2.0/cassandra/tools/toolsDrain.html>'
beforehand to avoid this issue. This makes the node unavailable for writes,
flushes the memtables and replays the commitlog.

Cheers,
-- 
Patricia Gorla
@patriciagorla

Consultant
Apache Cassandra Consulting
http://www.thelastpickle.com <http://thelastpickle.com>

Reply via email to