Nice email Dan. I would also add if you are still in the initial stages take a look at Hadoop+Pig. If your source data is write once read many it may be a better fit, but then you would also need to calculate the aggregates and store them somewhere.
So Cassandra *may* be just what you want. The ability to keep large amounts of data online with a high performance, and remove other servers form your stack is a definite plus. Aaron On 11/12/2010, at 7:01 PM, Dan Hendry <dan.hendry.j...@gmail.com> wrote: > Perhaps other, more experienced and reputable contributors to this list can > comment but to be frank: Cassandra is probably not for you (at least for > now). I personally feel Cassandra is one of the stronger NoSQL options out > there and has the potential to become the defacto standard; but its not quite > there yet and does not inherently meet your requirements. > > To give you some background, I started experimenting with Cassandra as a > personal project to avoid having to look through days worth of server logs > (and because I thought it was cool). The project ballooned and has become my > organizations primary metrics and analytics platform which currently > processes 200 million+ events/records per day. I doubt any traditional > database solution could have performed as well as Cassandra but the > development and operations process has not been without severe growing pains. > >> 1. Storing several million data records per day (each record will be a >> few KB in size) without any data loss. > > Absolutely, no problems on this front. A cluster of moderately beefy servers > will handle this with no complaints. As long as you are careful to avoid > hotspots in your data distribution, Cassandra truly is damn near linearly > scalable with hardware. > >> 2. Aggregation of certain fields in the stored records, like Avg >> across time period. > > Cassandra cannot do this on its own (by design and for good reason). There > have been efforts to add support for higher level data processing languages > (such as pig and hive) but they are not 'out of the box solutions' and in my > experience, difficult to get working properly. I ended up writing my own data > processing/report generation framework that works ridiculously well for my > particular case. In relation to your requirements, calculating averages > across fields would probably have to be implemented manually (and executed as > a periodic, automated task). Although non-trivial this isn’t quite as bad as > you might think. > >> 3. Using certain existing fields to calculate new values on the fly >> and store it too. > > Not quite sure what you are asking here. To go back to the last point to > calculate anything new, you are probably going to have to load all the > records on which that calculation depends into a separate process/server. > Generally, I would say Cassandra isn’t particularly good at 'on the fly' data > aggregation tasks (certainly not at all to the extent an SQL database is). To > be fair, thats also not what it is explicitly designed for or advertised to > do well. > >> 4. We were wondering if pre-aggregation was a good choice (calculating >> aggregation per 1 min, 5 min, 15 min etc ahead of time) but in case we >> need ad-hoc aggregation, does Cassandra support that over this amount >> of data? > > Cassandra is GREAT for accessing/storing/retrieving/post-processing anything > that can be pre-computed. If you have been doing any amount of reading, you > will likely have heard that in SQL you model data, in Cassandra (and most > other NoSQL databases) you model your queries (sorry for ripping off whoever > said this originally). If there is one thing/concept I can say that I have > learned about Cassandra is pre-compute (or asynchronously compute) anything > you possibly can and don’t be afraid to write a ridiculous amount to the > Cassandra database. In terms of ad-hoc aggregation, there is no nice simple > scripting language for Cassandra data processing (eg SQL). That said, you can > do most things pretty quick with a bit of code. Consider that loading a few > hundred to a few thousand record (< 3k) can be pretty quick (< 100 ms, often > < 10 ms particularly if they are cached). Our organization basically uses the > following approach: 'use Cassandra for generating continuous 10 second > accuracy time series reports but MySQL and a production DB replica for any > ad-hoc single value report the boss wants NOW'. > > > Based on what you have described, it sounds like you are thinking about your > problem from a SQL-like point of view: store data once then > query/filter/aggregate it in multiple different ways to obtain useful > information. If possible try to leverage the power of Cassandra and store it > in efficient and per-query pre-optimized forms. For example, I can imagine > the average call duration being an important parameter in a system analyzing > call data records. Instead of storing all the information about a call in one > place, store the 'call duration' in a separate column family, each row > containing a single integer representing call duarations for a given hour > (column name being the TimeUUID). My metrics system does something similar to > this and loads batches of 15,000 records (column slice) in < 200 ms. By > parallelizing across 10 threads loading from different rows, I can process > the average, standard deviation and a factor roughly meaning 'how close to > Gaussian' for 1 million records in < 5 seconds. > > To reiterate, Cassandra is not the solution if you are looking for 'Database: > I command thee to give me the average of field x.' That said, I have found > its overall data-processing capabilities to be reasonably impressive. > > Dan > > -----Original Message----- > From: Arun Cherian [mailto:archer...@gmail.com] > Sent: December-10-10 16:43 > To: user@cassandra.apache.org > Subject: Cassandra for Ad-hoc Aggregation and formula calculation > > Hi, > > I have been reading up on Cassandra for the past few weeks and I am > highly impressed by the features it offers. At work, we are starting > work on a product that will handle several million CDR (Call Data > Record, basically can be thought of as a .CSV file) per day. We will > have to store the data, and perform aggregations and calculations on > them. A few veteran RDBMS admin friends (we are a small .NET shop, we > don't have any in-house DB talent) recommended Infobright and noSQL to > us, and hence my search. I was wondering if Cassandra is a good fit > for > > 1. Storing several million data records per day (each record will be a > few KB in size) without any data loss. > 2. Aggregation of certain fields in the stored records, like Avg > across time period. > 3. Using certain existing fields to calculate new values on the fly > and store it too. > 4. We were wondering if pre-aggregation was a good choice (calculating > aggregation per 1 min, 5 min, 15 min etc ahead of time) but in case we > need ad-hoc aggregation, does Cassandra support that over this amount > of data? > > Thanks, > Arun > No virus found in this incoming message. > Checked by AVG - www.avg.com > Version: 9.0.872 / Virus Database: 271.1.1/3307 - Release Date: 12/10/10 > 02:37:00 >