Hi,

I need to build a system that stores data for years, so yes, I am backing
up data in another mass storage system from where it could be later
accessed. The data that I successfully back up has to be deleted from my
cluster to make space for new data coming in.

I was aware about the snapshotting which I will use for getting the data
out of node2: it creates hard links to the SSTables of a CF and then I can
copy over those files pointed to by the hard links into another location.
After that I get rid of the snapshot (hard links) and then I can truncate
my CFs. It's clear that snapshotting will give me a single copy of the data
in case I have a unique copy of the data on one node. It's not clear to me
what happens if I have let's say a cluster with 3 nodes and RF=2 and I do a
snapshot of every node and copy those snapshots to remote storage. Will I
get a single copy of the data in the remote storage or will it be twice the
data (data + replica)?

I've started reading about TTL and I think I can use it but it's not clear
to me how it would work in conjunction with the snapshotting/backing up I
need to do. I mean, it will impose a deadline by which I need to perform a
backup in order not to miss any data. Also, I might duplicate the data if
some columns don't expire fully between 2 backups. Any clarifications on
this?

Cheers,
Alex

On Tue, Jan 3, 2012 at 9:44 AM, aaron morton <aa...@thelastpickle.com>wrote:

> That sounds a little complicated.
>
> Do you want to get the data out for an off node backup or is it for
> processing in another system ?
>
> You may get by using:
>
> * TTL to expire data via compaction
> * snapshots for backups
>
> Cheers
>
> -----------------
> Aaron Morton
> Freelance Developer
> @aaronmorton
> http://www.thelastpickle.com
>
> On 3/01/2012, at 11:00 AM, Alexandru Sicoe wrote:
>
> Hi everyone and Happy New Year!
>
> I need advice for organizing data flow outside of my 3 node Cassandra
> 0.8.6 cluster. I am configuring my keyspace to use the
> NetworkTopologyStrategy. I have 2 data centers each with a replication
> factor 1 (i.e. DC1:1; DC2:1) the configuration of the PropertyFileSnitch is:
>
>
> ip_node1=DC1:RAC1
>
> ip_node2=DC2:RAC1
>
> ip_node3=DC1:RAC1
> I assign tokens like this:
>                         node1 = 0
>                         node2 = 1
>                         node3 = 85070591730234615865843651857942052864
>
> My write consistency level is ANY.
>
> My data sources are only inserting data in node1 & node3. Essentially what
> happens is that a replica of every input value will end up on node2. Node 2
> thus has a copy of the entire data written to the cluster. When Node2
> starts getting full, I want to have a script which pulls it off-line and
> does a sequence of operations (compaction/snapshotting/exporting/truncating
> the CFs) in order to back up the data in a remote place and to free it up
> so that it can take more data. When it comes back on-line it will take
> hints from the other 2 nodes.
>
> This is how I plan on shipping data out of my cluster without any downtime
> or any major performance penalty. The problem is when I want to also
> truncate the CFs in node1 & node3 to also free them up of data. I don't
> know whether I can do this without any downtime or without any serious
> performance penalties. Is anyone using truncate to free up CFs of data? How
> efficient is this?
>
> Any observations or suggestions are much appreciated!
>
> Cheers,
> Alex
>
>
>

Reply via email to