Hi, On Wed, Jan 4, 2012 at 9:54 PM, aaron morton <aa...@thelastpickle.com>wrote:
> Some thoughts on the plan: > > * You are monkeying around with things, do not be surprised when > surprising things happen. > I am just trying to explore different solutions for solving my problem. > * Deliberately unbalancing the cluster may lead to Bad Things happening. > I will take your advice on this. I would have liked to have an extra node to have 2 nodes in each DC. > * In the design discussed it is perfectly reasonable for data not to be on > the archive node. > You mean when having the 2 DC setup I mentioned and using TTL? In case I have the 2 DC setup but don't use TTL I don't understand why data wouldn't be on the archive node? > * Truncate is a cluster wide operation and all nodes must be online before > it will start. > * Truncate will snapshot before deleting data, you could use this snapshot. > * TTL for a column is for a column no matter which node it is on. > Thanks for clarifying these! > * IMHO Cassandra data files (sstables or JSON dumps) are not a good format > for a historical archive, nothing against Cassandra. You need the lowest > common format. > So what data format should I use for historical archiving? > > If you have the resources for a second cluster could you put the two > together and just have one cluster with a very large retention policy? One > cluster is easier than two. > I am constrained to have limited retention on the Cassandra cluster that is collecting the data . Once I archive the data for long term storage I cannot bring it back in the same Cassandra cluster that collected it in the first place because it's in an enclosed network with strict rules. I have to load it in another cluster outside the enclosed network. It's not that I have the resources for a second cluster, I am forced to use a second cluster. > > Assuming there is no business case for this, consider either: > > * Dumping the historical data into a Hadoop (with or without HDFS) cluster > with high compression. If needed you could then run Hive / Pig to fill a > companion Cassandra cluster with data on demand. Or just query using Hadoop. > * Dumping the historical data to files with high compression and a roll > your own solution to fill a cluster. > > Ok, thanks for these suggestions, I will have to investigate further. > Also considering talking to Data Stax about DSE. > > Cheers > > ----------------- > Aaron Morton > Freelance Developer > @aaronmorton > http://www.thelastpickle.com > > On 5/01/2012, at 1:41 AM, Alexandru Sicoe wrote: > > Cheers, Alex > Hi, > > On Tue, Jan 3, 2012 at 8:19 PM, aaron morton <aa...@thelastpickle.com>wrote: > >> Running a time based rolling window of data can be done using the TTL. >> Backing up the nodes for disaster recover can be done using snapshots. >> Restoring any point in time will be tricky because to may restore columns >> where the TTL has expired. >> > > Yeah, that's the thing...if I want to use the system as I explain further > below, I cannot do backing up of data (for later restoration) if I'm using > TTLs. > > >> >> Will I get a single copy of the data in the remote storage or will it be >> twice the data (data + replica)? >> >> You will RF copies of the data. (By the way, there is no original copy) >> > > Well, if I organize the cluster as I mentioned in the first email, I will > get one copy of each row at a certain point in time on node2 if I take it > offline, perform a major compaction and GC, won't I? I don't want to send > duplicated data to the mass storage! > > >> >> Can you share a bit more about the use case ? How much data and what sort >> of read patterns ? >> >> > I have several applications that feed into Cassandra about 2 million > different variables (each representing a different monitoring > value/channel). The system receives updates for each of these monitoring > values at different rates. For each new update, the timestamp and value are > recorded in a Cassandra name-value pair. The schema of Cassandra is built > using one CF for data and 4 other CFs for metadata (metadata CFs are static > - don't grow almost at all once they've been loaded). The data CF uses a > row for each variable. Each row acts as a 4 hour time bin. I achieve this > by creating the row key as a concatenation of the first 6 digits of the > timestamp at which the data is inserted + the unique ID of the variable. > After the time bin expires, a new row will be created for the same variable > ID. > > The system can currently sustain the insertion load. Now I'm looking into > organizing > the flow of data out of the cluster and retrieval performance for random > queries: > > Why do I need to organize the data out? Well, my requirement is to keep > all the data coming into the system at the highest granularity for long > term (several years). The 3 node cluster I mentioned is the online cluster > which is supposed to be able to absorb the input load for a relatively > short period of time, a few weeks (I am constrained to do this). After this > period the data has to be shipped out of the cluster in a mass storage > facility and the cluster needs to be emptied to make room for more data. > Also, the online cluster will serve reads while it takes in data. For older > data I am planning to have another cluster that gets loaded with data from > the storage facility on demand and will serve reads from there. > > Why random queries? There is no specific use case about them, that's why I > want to rely only on the built in Cassandra indexes for now. Generally the > client will ask for sets of values within a time range up to 8-10 hours in > the past. Apart from some sets of variables that will be almost always > asked together, any combination is possible because this system will feed > in a web dashboard which will be used for debugging purposes - to > correlate and aggregate streams of variables. Depending on the problem, > different variable combinations could be investigated. > > >> Can you split the data stream into a permanent log record and also >> into cassandra for a rolling window of query able data ? >> > > In the end, essentially that's what I've been meaning to do with > organizing the cluster in a 2 DC setup: i wanted to have 2 nodes in DC1 > taking the data and reads (the rolling window) and replicating to the node > in DC2 (the permanent log - of a single copy of the data). I was thinking > of implementing the rolling window by emptying the nodes in DC1 using > truncate instead of what you propose now with the rolling window using TTL. > > Ok, so I can do what you are saying easily if Cassandra allows me to have > a TTL only on the first copy of the data and have the second replica > without a TTL. Is this possible? I think it would solve my problem, as long > as I can backup and empty the node in DC2 before the TTLs expire in the > other 2 nodes. > > Cheers, > Alex > > >> Cheers >> >> ----------------- >> Aaron Morton >> Freelance Developer >> @aaronmorton >> http://www.thelastpickle.com >> >> On 3/01/2012, at 11:41 PM, Alexandru Sicoe wrote: >> >> Hi, >> >> I need to build a system that stores data for years, so yes, I am backing >> up data in another mass storage system from where it could be later >> accessed. The data that I successfully back up has to be deleted from my >> cluster to make space for new data coming in. >> >> I was aware about the snapshotting which I will use for getting the data >> out of node2: it creates hard links to the SSTables of a CF and then I can >> copy over those files pointed to by the hard links into another location. >> After that I get rid of the snapshot (hard links) and then I can truncate >> my CFs. It's clear that snapshotting will give me a single copy of the data >> in case I have a unique copy of the data on one node. It's not clear to me >> what happens if I have let's say a cluster with 3 nodes and RF=2 and I do a >> snapshot of every node and copy those snapshots to remote storage. Will I >> get a single copy of the data in the remote storage or will it be twice the >> data (data + replica)? >> >> I've started reading about TTL and I think I can use it but it's not >> clear to me how it would work in conjunction with the snapshotting/backing >> up I need to do. I mean, it will impose a deadline by which I need to >> perform a backup in order not to miss any data. Also, I might duplicate the >> data if some columns don't expire fully between 2 backups. Any >> clarifications on this? >> >> Cheers, >> Alex >> >> On Tue, Jan 3, 2012 at 9:44 AM, aaron morton <aa...@thelastpickle.com>wrote: >> >>> That sounds a little complicated. >>> >>> Do you want to get the data out for an off node backup or is it for >>> processing in another system ? >>> >>> You may get by using: >>> >>> * TTL to expire data via compaction >>> * snapshots for backups >>> >>> Cheers >>> >>> ----------------- >>> Aaron Morton >>> Freelance Developer >>> @aaronmorton >>> http://www.thelastpickle.com >>> >>> On 3/01/2012, at 11:00 AM, Alexandru Sicoe wrote: >>> >>> Hi everyone and Happy New Year! >>> >>> I need advice for organizing data flow outside of my 3 node Cassandra >>> 0.8.6 cluster. I am configuring my keyspace to use the >>> NetworkTopologyStrategy. I have 2 data centers each with a replication >>> factor 1 (i.e. DC1:1; DC2:1) the configuration of the PropertyFileSnitch is: >>> >>> >>> ip_node1=DC1:RAC1 >>> >>> ip_node2=DC2:RAC1 >>> >>> ip_node3=DC1:RAC1 >>> I assign tokens like this: >>> node1 = 0 >>> node2 = 1 >>> node3 = 85070591730234615865843651857942052864 >>> >>> My write consistency level is ANY. >>> >>> My data sources are only inserting data in node1 & node3. Essentially >>> what happens is that a replica of every input value will end up on node2. >>> Node 2 thus has a copy of the entire data written to the cluster. When >>> Node2 starts getting full, I want to have a script which pulls it off-line >>> and does a sequence of operations >>> (compaction/snapshotting/exporting/truncating the CFs) in order to back up >>> the data in a remote place and to free it up so that it can take more data. >>> When it comes back on-line it will take hints from the other 2 nodes. >>> >>> This is how I plan on shipping data out of my cluster without any >>> downtime or any major performance penalty. The problem is when I want to >>> also truncate the CFs in node1 & node3 to also free them up of data. I >>> don't know whether I can do this without any downtime or without any >>> serious performance penalties. Is anyone using truncate to free up CFs of >>> data? How efficient is this? >>> >>> Any observations or suggestions are much appreciated! >>> >>> Cheers, >>> Alex >>> >>> >>> >> >> > >