Some thoughts on the plan: * You are monkeying around with things, do not be surprised when surprising things happen. * Deliberately unbalancing the cluster may lead to Bad Things happening. * In the design discussed it is perfectly reasonable for data not to be on the archive node. * Truncate is a cluster wide operation and all nodes must be online before it will start. * Truncate will snapshot before deleting data, you could use this snapshot. * TTL for a column is for a column no matter which node it is on. * IMHO Cassandra data files (sstables or JSON dumps) are not a good format for a historical archive, nothing against Cassandra. You need the lowest common format.
If you have the resources for a second cluster could you put the two together and just have one cluster with a very large retention policy? One cluster is easier than two. Assuming there is no business case for this, consider either: * Dumping the historical data into a Hadoop (with or without HDFS) cluster with high compression. If needed you could then run Hive / Pig to fill a companion Cassandra cluster with data on demand. Or just query using Hadoop. * Dumping the historical data to files with high compression and a roll your own solution to fill a cluster. Also considering talking to Data Stax about DSE. Cheers ----------------- Aaron Morton Freelance Developer @aaronmorton http://www.thelastpickle.com On 5/01/2012, at 1:41 AM, Alexandru Sicoe wrote: > Hi, > > On Tue, Jan 3, 2012 at 8:19 PM, aaron morton <[email protected]> wrote: > Running a time based rolling window of data can be done using the TTL. > Backing up the nodes for disaster recover can be done using snapshots. > Restoring any point in time will be tricky because to may restore columns > where the TTL has expired. > > Yeah, that's the thing...if I want to use the system as I explain further > below, I cannot do backing up of data (for later restoration) if I'm using > TTLs. > > >> Will I get a single copy of the data in the remote storage or will it be >> twice the data (data + replica)? > You will RF copies of the data. (By the way, there is no original copy) > > Well, if I organize the cluster as I mentioned in the first email, I will get > one copy of each row at a certain point in time on node2 if I take it > offline, perform a major compaction and GC, won't I? I don't want to send > duplicated data to the mass storage! > > > Can you share a bit more about the use case ? How much data and what sort of > read patterns ? > > > I have several applications that feed into Cassandra about 2 million > different variables (each representing a different monitoring value/channel). > The system receives updates for each of these monitoring values at different > rates. For each new update, the timestamp and value are recorded in a > Cassandra name-value pair. The schema of Cassandra is built using one CF for > data and 4 other CFs for metadata (metadata CFs are static - don't grow > almost at all once they've been loaded). The data CF uses a row for each > variable. Each row acts as a 4 hour time bin. I achieve this by creating the > row key as a concatenation of the first 6 digits of the timestamp at which > the data is inserted + the unique ID of the variable. After the time bin > expires, a new row will be created for the same variable ID. > > The system can currently sustain the insertion load. Now I'm looking into > organizing the flow of data out of the cluster and retrieval performance for > random queries: > > Why do I need to organize the data out? Well, my requirement is to keep all > the data coming into the system at the highest granularity for long term > (several years). The 3 node cluster I mentioned is the online cluster which > is supposed to be able to absorb the input load for a relatively short period > of time, a few weeks (I am constrained to do this). After this period the > data has to be shipped out of the cluster in a mass storage facility and the > cluster needs to be emptied to make room for more data. Also, the online > cluster will serve reads while it takes in data. For older data I am planning > to have another cluster that gets loaded with data from the storage facility > on demand and will serve reads from there. > > Why random queries? There is no specific use case about them, that's why I > want to rely only on the built in Cassandra indexes for now. Generally the > client will ask for sets of values within a time range up to 8-10 hours in > the past. Apart from some sets of variables that will be almost always asked > together, any combination is possible because this system will feed in a web > dashboard which will be used for debugging purposes - to correlate and > aggregate streams of variables. Depending on the problem, different variable > combinations could be investigated. > > Can you split the data stream into a permanent log record and also into > cassandra for a rolling window of query able data ? > > In the end, essentially that's what I've been meaning to do with organizing > the cluster in a 2 DC setup: i wanted to have 2 nodes in DC1 taking the data > and reads (the rolling window) and replicating to the node in DC2 (the > permanent log - of a single copy of the data). I was thinking of implementing > the rolling window by emptying the nodes in DC1 using truncate instead of > what you propose now with the rolling window using TTL. > > Ok, so I can do what you are saying easily if Cassandra allows me to have a > TTL only on the first copy of the data and have the second replica without a > TTL. Is this possible? I think it would solve my problem, as long as I can > backup and empty the node in DC2 before the TTLs expire in the other 2 nodes. > > Cheers, > Alex > > > Cheers > > ----------------- > Aaron Morton > Freelance Developer > @aaronmorton > http://www.thelastpickle.com > > On 3/01/2012, at 11:41 PM, Alexandru Sicoe wrote: > >> Hi, >> >> I need to build a system that stores data for years, so yes, I am backing up >> data in another mass storage system from where it could be later accessed. >> The data that I successfully back up has to be deleted from my cluster to >> make space for new data coming in. >> >> I was aware about the snapshotting which I will use for getting the data out >> of node2: it creates hard links to the SSTables of a CF and then I can copy >> over those files pointed to by the hard links into another location. After >> that I get rid of the snapshot (hard links) and then I can truncate my CFs. >> It's clear that snapshotting will give me a single copy of the data in case >> I have a unique copy of the data on one node. It's not clear to me what >> happens if I have let's say a cluster with 3 nodes and RF=2 and I do a >> snapshot of every node and copy those snapshots to remote storage. Will I >> get a single copy of the data in the remote storage or will it be twice the >> data (data + replica)? >> >> I've started reading about TTL and I think I can use it but it's not clear >> to me how it would work in conjunction with the snapshotting/backing up I >> need to do. I mean, it will impose a deadline by which I need to perform a >> backup in order not to miss any data. Also, I might duplicate the data if >> some columns don't expire fully between 2 backups. Any clarifications on >> this? >> >> Cheers, >> Alex >> >> On Tue, Jan 3, 2012 at 9:44 AM, aaron morton <[email protected]> wrote: >> That sounds a little complicated. >> >> Do you want to get the data out for an off node backup or is it for >> processing in another system ? >> >> You may get by using: >> >> * TTL to expire data via compaction >> * snapshots for backups >> >> Cheers >> >> ----------------- >> Aaron Morton >> Freelance Developer >> @aaronmorton >> http://www.thelastpickle.com >> >> On 3/01/2012, at 11:00 AM, Alexandru Sicoe wrote: >> >>> Hi everyone and Happy New Year! >>> >>> I need advice for organizing data flow outside of my 3 node Cassandra 0.8.6 >>> cluster. I am configuring my keyspace to use the NetworkTopologyStrategy. I >>> have 2 data centers each with a replication factor 1 (i.e. DC1:1; DC2:1) >>> the configuration of the PropertyFileSnitch is: >>> >>> >>> ip_node1=DC1:RAC1 >>> >>> ip_node2=DC2:RAC1 >>> >>> ip_node3=DC1:RAC1 >>> I assign tokens like this: >>> node1 = 0 >>> node2 = 1 >>> node3 = 85070591730234615865843651857942052864 >>> >>> My write consistency level is ANY. >>> >>> My data sources are only inserting data in node1 & node3. Essentially what >>> happens is that a replica of every input value will end up on node2. Node 2 >>> thus has a copy of the entire data written to the cluster. When Node2 >>> starts getting full, I want to have a script which pulls it off-line and >>> does a sequence of operations (compaction/snapshotting/exporting/truncating >>> the CFs) in order to back up the data in a remote place and to free it up >>> so that it can take more data. When it comes back on-line it will take >>> hints from the other 2 nodes. >>> >>> This is how I plan on shipping data out of my cluster without any downtime >>> or any major performance penalty. The problem is when I want to also >>> truncate the CFs in node1 & node3 to also free them up of data. I don't >>> know whether I can do this without any downtime or without any serious >>> performance penalties. Is anyone using truncate to free up CFs of data? How >>> efficient is this? >>> >>> Any observations or suggestions are much appreciated! >>> >>> Cheers, >>> Alex >> >> > >
