Some thoughts on the plan:

* You are monkeying around with things, do not be surprised when surprising 
things happen. 
* Deliberately unbalancing the cluster may lead to Bad Things happening. 
* In the design discussed it is perfectly reasonable for data not to be on the 
archive node. 
* Truncate is a cluster wide operation and all nodes must be online before it 
will start. 
* Truncate will snapshot before deleting data, you could use this snapshot. 
* TTL for a column is for a column no matter which node it is on. 
* IMHO Cassandra data files (sstables or JSON dumps) are not a good format for 
a historical archive, nothing against Cassandra. You need the lowest common 
format. 

If you have the resources for a second cluster could you put the two together 
and just have one cluster with a very large retention policy? One cluster is 
easier than two.  

Assuming there is no business case for this, consider either:

* Dumping the historical data into a Hadoop (with or without HDFS) cluster with 
high compression. If needed you could then run Hive / Pig to fill a companion 
Cassandra cluster with data on demand. Or just query using Hadoop.
* Dumping the historical data to files with high compression and a roll your 
own solution to fill a cluster. 

Also considering talking to Data Stax about DSE. 

Cheers 
  
-----------------
Aaron Morton
Freelance Developer
@aaronmorton
http://www.thelastpickle.com

On 5/01/2012, at 1:41 AM, Alexandru Sicoe wrote:

> Hi,
> 
> On Tue, Jan 3, 2012 at 8:19 PM, aaron morton <[email protected]> wrote:
> Running a time based rolling window of data can be done using the TTL. 
> Backing up the nodes for disaster recover can be done using snapshots. 
> Restoring any point in time will be tricky because to may restore columns 
> where the TTL has expired. 
>  
> Yeah, that's the thing...if I want to use the system as I explain further 
> below, I cannot do backing up of data (for later restoration) if I'm using 
> TTLs. 
>  
> 
>> Will I get a single copy of the data in the remote storage or will it be 
>> twice the data (data + replica)?
> You will  RF copies of the data. (By the way, there is no original copy)
> 
> Well, if I organize the cluster as I mentioned in the first email, I will get 
> one copy of each row at a certain point in time on node2 if I take it 
> offline, perform a major compaction and GC, won't I? I don't want to send 
> duplicated data to the mass storage!
>  
> 
> Can you share a bit more about the use case ? How much data and what sort of 
> read patterns ? 
> 
> 
> I have several applications that feed into Cassandra about 2 million 
> different variables (each representing a different monitoring value/channel). 
> The system receives updates for each of these monitoring values at different 
> rates. For each new update, the timestamp and value are recorded in a 
> Cassandra name-value pair. The schema of Cassandra is built using one CF for 
> data and 4 other CFs for metadata (metadata CFs are static - don't grow 
> almost at all once they've been loaded). The data CF uses a row for each 
> variable. Each row acts as a 4 hour time bin. I achieve this by creating the 
> row key as a concatenation of  the first 6 digits of the timestamp at which 
> the data is inserted + the unique ID of the variable. After the time bin 
> expires, a new row will be created for the same variable ID.
> 
> The system can currently sustain the insertion load. Now I'm looking into 
> organizing the flow of data out of the cluster and retrieval performance for 
> random queries:
> 
> Why do I need to organize the data out? Well, my requirement is to keep all 
> the data coming into the system at the highest granularity for long term 
> (several years). The 3 node cluster I mentioned is the online cluster which 
> is supposed to be able to absorb the input load for a relatively short period 
> of time, a few weeks (I am constrained to do this). After this period the 
> data has to be shipped out of the cluster in a mass storage facility and the 
> cluster needs to be emptied to make room for more data. Also, the online 
> cluster will serve reads while it takes in data. For older data I am planning 
> to have another cluster that gets loaded with data from the storage facility 
> on demand and will serve reads from there.
> 
> Why random queries? There is no specific use case about them, that's why I 
> want to rely only on the built in Cassandra indexes for now.  Generally the 
> client will ask for sets of values within a time range up to 8-10 hours in 
> the past. Apart from some sets of variables that will be almost always asked 
> together, any combination is possible because this system will feed in a web 
> dashboard which will be used for debugging purposes  - to correlate and 
> aggregate streams of variables. Depending on the problem, different variable 
> combinations could be investigated. 
>  
> Can you split the data stream into a permanent log record and also into 
> cassandra for a rolling window of query able data ?   
> 
> In the end, essentially that's what I've been meaning to do with organizing 
> the cluster in a 2 DC setup: i wanted to have 2 nodes in DC1 taking the data 
> and reads (the rolling window) and replicating to the node in DC2 (the 
> permanent log - of a single copy of the data). I was thinking of implementing 
> the rolling window by emptying the nodes in DC1 using truncate instead of 
> what you propose now with the rolling window using TTL. 
> 
> Ok, so I can do what you are saying easily if Cassandra allows me to have a 
> TTL only on the first copy of the data and have the second replica without a 
> TTL. Is this possible? I think it would solve my problem, as long as I can 
> backup and empty the node in DC2 before the TTLs expire in the other 2 nodes.
> 
> Cheers,
> Alex
> 
> 
> Cheers
> 
> -----------------
> Aaron Morton
> Freelance Developer
> @aaronmorton
> http://www.thelastpickle.com
> 
> On 3/01/2012, at 11:41 PM, Alexandru Sicoe wrote:
> 
>> Hi,
>> 
>> I need to build a system that stores data for years, so yes, I am backing up 
>> data in another mass storage system from where it could  be later accessed. 
>> The data that I successfully back up has to be deleted from my cluster to 
>> make space for new data coming in.
>> 
>> I was aware about the snapshotting which I will use for getting the data out 
>> of node2: it creates hard links to the SSTables of a CF and then I can copy 
>> over those files pointed to by the hard links into another location. After 
>> that I get rid of the snapshot (hard links) and then I can truncate my CFs. 
>> It's clear that snapshotting will give me a single copy of the data in case 
>> I have a unique copy of the data on one node. It's not clear to me what 
>> happens if I have let's say a cluster with 3 nodes and RF=2 and I do a 
>> snapshot of every node and copy those snapshots to remote storage. Will I 
>> get a single copy of the data in the remote storage or will it be twice the 
>> data (data + replica)?
>> 
>> I've started reading about TTL and I think I can use it but it's not clear 
>> to me how it would work in conjunction with the snapshotting/backing up I 
>> need to do. I mean, it will impose a deadline by which I need to perform a 
>> backup in order not to miss any data. Also, I might duplicate the data if 
>> some columns don't expire fully between 2 backups. Any clarifications on 
>> this?
>> 
>> Cheers,
>> Alex
>> 
>> On Tue, Jan 3, 2012 at 9:44 AM, aaron morton <[email protected]> wrote:
>> That sounds a little complicated. 
>> 
>> Do you want to get the data out for an off node backup or is it for 
>> processing in another system ? 
>> 
>> You may get by using:
>> 
>> * TTL to expire data via compaction
>> * snapshots for backups
>> 
>> Cheers
>> 
>> -----------------
>> Aaron Morton
>> Freelance Developer
>> @aaronmorton
>> http://www.thelastpickle.com
>> 
>> On 3/01/2012, at 11:00 AM, Alexandru Sicoe wrote:
>> 
>>> Hi everyone and Happy New Year!
>>> 
>>> I need advice for organizing data flow outside of my 3 node Cassandra 0.8.6 
>>> cluster. I am configuring my keyspace to use the NetworkTopologyStrategy. I 
>>> have 2 data centers each with a replication factor 1 (i.e. DC1:1; DC2:1) 
>>> the configuration of the PropertyFileSnitch is:
>>>                               
>>>                                                                    
>>> ip_node1=DC1:RAC1
>>>                                                                             
>>>                      ip_node2=DC2:RAC1
>>>                                                                             
>>>                      ip_node3=DC1:RAC1
>>> I assign tokens like this:
>>>                         node1 = 0
>>>                         node2 = 1
>>>                         node3 = 85070591730234615865843651857942052864
>>> 
>>> My write consistency level is ANY.
>>> 
>>> My data sources are only inserting data in node1 & node3. Essentially what 
>>> happens is that a replica of every input value will end up on node2. Node 2 
>>> thus has a copy of the entire data written to the cluster. When Node2 
>>> starts getting full, I want to have a script which pulls it off-line and 
>>> does a sequence of operations (compaction/snapshotting/exporting/truncating 
>>> the CFs) in order to back up the data in a remote place and to free it up 
>>> so that it can take more data. When it comes back on-line it will take 
>>> hints from the other 2 nodes.
>>> 
>>> This is how I plan on shipping data out of my cluster without any downtime 
>>> or any major performance penalty. The problem is when I want to also 
>>> truncate the CFs in node1 & node3 to also free them up of data. I don't 
>>> know whether I can do this without any downtime or without any serious 
>>> performance penalties. Is anyone using truncate to free up CFs of data? How 
>>> efficient is this?
>>> 
>>> Any observations or suggestions are much appreciated!
>>> 
>>> Cheers,
>>> Alex
>> 
>> 
> 
> 

Reply via email to