Hi, I need to build a system that stores data for years, so yes, I am backing up data in another mass storage system from where it could be later accessed. The data that I successfully back up has to be deleted from my cluster to make space for new data coming in.
I was aware about the snapshotting which I will use for getting the data out of node2: it creates hard links to the SSTables of a CF and then I can copy over those files pointed to by the hard links into another location. After that I get rid of the snapshot (hard links) and then I can truncate my CFs. It's clear that snapshotting will give me a single copy of the data in case I have a unique copy of the data on one node. It's not clear to me what happens if I have let's say a cluster with 3 nodes and RF=2 and I do a snapshot of every node and copy those snapshots to remote storage. Will I get a single copy of the data in the remote storage or will it be twice the data (data + replica)? I've started reading about TTL and I think I can use it but it's not clear to me how it would work in conjunction with the snapshotting/backing up I need to do. I mean, it will impose a deadline by which I need to perform a backup in order not to miss any data. Also, I might duplicate the data if some columns don't expire fully between 2 backups. Any clarifications on this? Cheers, Alex On Tue, Jan 3, 2012 at 9:44 AM, aaron morton <aa...@thelastpickle.com>wrote: > That sounds a little complicated. > > Do you want to get the data out for an off node backup or is it for > processing in another system ? > > You may get by using: > > * TTL to expire data via compaction > * snapshots for backups > > Cheers > > ----------------- > Aaron Morton > Freelance Developer > @aaronmorton > http://www.thelastpickle.com > > On 3/01/2012, at 11:00 AM, Alexandru Sicoe wrote: > > Hi everyone and Happy New Year! > > I need advice for organizing data flow outside of my 3 node Cassandra > 0.8.6 cluster. I am configuring my keyspace to use the > NetworkTopologyStrategy. I have 2 data centers each with a replication > factor 1 (i.e. DC1:1; DC2:1) the configuration of the PropertyFileSnitch is: > > > ip_node1=DC1:RAC1 > > ip_node2=DC2:RAC1 > > ip_node3=DC1:RAC1 > I assign tokens like this: > node1 = 0 > node2 = 1 > node3 = 85070591730234615865843651857942052864 > > My write consistency level is ANY. > > My data sources are only inserting data in node1 & node3. Essentially what > happens is that a replica of every input value will end up on node2. Node 2 > thus has a copy of the entire data written to the cluster. When Node2 > starts getting full, I want to have a script which pulls it off-line and > does a sequence of operations (compaction/snapshotting/exporting/truncating > the CFs) in order to back up the data in a remote place and to free it up > so that it can take more data. When it comes back on-line it will take > hints from the other 2 nodes. > > This is how I plan on shipping data out of my cluster without any downtime > or any major performance penalty. The problem is when I want to also > truncate the CFs in node1 & node3 to also free them up of data. I don't > know whether I can do this without any downtime or without any serious > performance penalties. Is anyone using truncate to free up CFs of data? How > efficient is this? > > Any observations or suggestions are much appreciated! > > Cheers, > Alex > > >