Perhaps other, more experienced and reputable contributors to this list can comment but to be frank: Cassandra is probably not for you (at least for now). I personally feel Cassandra is one of the stronger NoSQL options out there and has the potential to become the defacto standard; but its not quite there yet and does not inherently meet your requirements.
To give you some background, I started experimenting with Cassandra as a personal project to avoid having to look through days worth of server logs (and because I thought it was cool). The project ballooned and has become my organizations primary metrics and analytics platform which currently processes 200 million+ events/records per day. I doubt any traditional database solution could have performed as well as Cassandra but the development and operations process has not been without severe growing pains. > 1. Storing several million data records per day (each record will be a > few KB in size) without any data loss. Absolutely, no problems on this front. A cluster of moderately beefy servers will handle this with no complaints. As long as you are careful to avoid hotspots in your data distribution, Cassandra truly is damn near linearly scalable with hardware. > 2. Aggregation of certain fields in the stored records, like Avg > across time period. Cassandra cannot do this on its own (by design and for good reason). There have been efforts to add support for higher level data processing languages (such as pig and hive) but they are not 'out of the box solutions' and in my experience, difficult to get working properly. I ended up writing my own data processing/report generation framework that works ridiculously well for my particular case. In relation to your requirements, calculating averages across fields would probably have to be implemented manually (and executed as a periodic, automated task). Although non-trivial this isn’t quite as bad as you might think. > 3. Using certain existing fields to calculate new values on the fly > and store it too. Not quite sure what you are asking here. To go back to the last point to calculate anything new, you are probably going to have to load all the records on which that calculation depends into a separate process/server. Generally, I would say Cassandra isn’t particularly good at 'on the fly' data aggregation tasks (certainly not at all to the extent an SQL database is). To be fair, thats also not what it is explicitly designed for or advertised to do well. > 4. We were wondering if pre-aggregation was a good choice (calculating > aggregation per 1 min, 5 min, 15 min etc ahead of time) but in case we > need ad-hoc aggregation, does Cassandra support that over this amount > of data? Cassandra is GREAT for accessing/storing/retrieving/post-processing anything that can be pre-computed. If you have been doing any amount of reading, you will likely have heard that in SQL you model data, in Cassandra (and most other NoSQL databases) you model your queries (sorry for ripping off whoever said this originally). If there is one thing/concept I can say that I have learned about Cassandra is pre-compute (or asynchronously compute) anything you possibly can and don’t be afraid to write a ridiculous amount to the Cassandra database. In terms of ad-hoc aggregation, there is no nice simple scripting language for Cassandra data processing (eg SQL). That said, you can do most things pretty quick with a bit of code. Consider that loading a few hundred to a few thousand record (< 3k) can be pretty quick (< 100 ms, often < 10 ms particularly if they are cached). Our organization basically uses the following approach: 'use Cassandra for generating continuous 10 second accuracy time series reports but MySQL and a production DB replica for any ad-hoc single value report the boss wants NOW'. Based on what you have described, it sounds like you are thinking about your problem from a SQL-like point of view: store data once then query/filter/aggregate it in multiple different ways to obtain useful information. If possible try to leverage the power of Cassandra and store it in efficient and per-query pre-optimized forms. For example, I can imagine the average call duration being an important parameter in a system analyzing call data records. Instead of storing all the information about a call in one place, store the 'call duration' in a separate column family, each row containing a single integer representing call duarations for a given hour (column name being the TimeUUID). My metrics system does something similar to this and loads batches of 15,000 records (column slice) in < 200 ms. By parallelizing across 10 threads loading from different rows, I can process the average, standard deviation and a factor roughly meaning 'how close to Gaussian' for 1 million records in < 5 seconds. To reiterate, Cassandra is not the solution if you are looking for 'Database: I command thee to give me the average of field x.' That said, I have found its overall data-processing capabilities to be reasonably impressive. Dan -----Original Message----- From: Arun Cherian [mailto:archer...@gmail.com] Sent: December-10-10 16:43 To: user@cassandra.apache.org Subject: Cassandra for Ad-hoc Aggregation and formula calculation Hi, I have been reading up on Cassandra for the past few weeks and I am highly impressed by the features it offers. At work, we are starting work on a product that will handle several million CDR (Call Data Record, basically can be thought of as a .CSV file) per day. We will have to store the data, and perform aggregations and calculations on them. A few veteran RDBMS admin friends (we are a small .NET shop, we don't have any in-house DB talent) recommended Infobright and noSQL to us, and hence my search. I was wondering if Cassandra is a good fit for 1. Storing several million data records per day (each record will be a few KB in size) without any data loss. 2. Aggregation of certain fields in the stored records, like Avg across time period. 3. Using certain existing fields to calculate new values on the fly and store it too. 4. We were wondering if pre-aggregation was a good choice (calculating aggregation per 1 min, 5 min, 15 min etc ahead of time) but in case we need ad-hoc aggregation, does Cassandra support that over this amount of data? Thanks, Arun No virus found in this incoming message. Checked by AVG - www.avg.com Version: 9.0.872 / Virus Database: 271.1.1/3307 - Release Date: 12/10/10 02:37:00