RE: Cassandra for Ad-hoc Aggregation and formula calculation

Dan Hendry Fri, 10 Dec 2010 22:02:15 -0800

Perhaps other, more experienced and reputable contributors to this list can 
comment but to be frank: Cassandra is probably not for you (at least for now). 
I personally feel Cassandra is one of the stronger NoSQL options out there and 
has the potential to become the defacto standard; but its not quite there yet 
and does not inherently meet your requirements.


To give you some background, I started experimenting with Cassandra as a 
personal project to avoid having to look through days worth of server logs (and 
because I thought it was cool). The project ballooned and has become my 
organizations primary metrics and analytics platform which currently processes 
200 million+ events/records per day. I doubt any traditional database solution 
could have performed as well as Cassandra but the development and operations 
process has not been without severe growing pains. 

> 1. Storing several million data records per day (each record will be a
> few KB in size) without any data loss.

Absolutely, no problems on this front. A cluster of moderately beefy servers 
will handle this with no complaints. As long as you are careful to avoid 
hotspots in your data distribution, Cassandra truly is damn near linearly 
scalable with hardware. 

> 2. Aggregation of certain fields in the stored records, like Avg
> across time period.

Cassandra cannot do this on its own (by design and for good reason). There have 
been efforts to add support for higher level data processing languages (such as 
pig and hive) but they are not 'out of the box solutions' and in my experience, 
difficult to get working properly. I ended up writing my own data 
processing/report generation framework that works ridiculously well for my 
particular case. In relation to your requirements, calculating averages across 
fields would probably have to be implemented manually (and executed as a 
periodic, automated task). Although non-trivial this isn’t quite as bad as you 
might think.

> 3. Using certain existing fields to calculate new values on the fly
> and store it too.

Not quite sure what you are asking here. To go back to the last point to 
calculate anything new, you are probably going to have to load all the records 
on which that calculation depends into a separate process/server. Generally, I 
would say Cassandra isn’t particularly good at 'on the fly' data aggregation 
tasks (certainly not at all to the extent an SQL database is). To be fair, 
thats also not what it is explicitly designed for or advertised to do well. 

> 4. We were wondering if pre-aggregation was a good choice (calculating
> aggregation per 1 min, 5 min, 15 min etc ahead of time) but in case we
> need ad-hoc aggregation, does Cassandra support that over this amount
> of data?

Cassandra is GREAT for accessing/storing/retrieving/post-processing anything 
that can be pre-computed. If you have been doing any amount of reading, you 
will likely have heard that in SQL you model data, in Cassandra (and most other 
NoSQL databases) you model your queries (sorry for ripping off whoever said 
this originally). If there is one thing/concept I can say that I have learned 
about Cassandra is pre-compute (or asynchronously compute) anything you 
possibly can and don’t be afraid to write a ridiculous amount to the Cassandra 
database. In terms of ad-hoc aggregation, there is no nice simple scripting 
language for Cassandra data processing (eg SQL). That said, you can do most 
things pretty quick with a bit of code. Consider that loading a few hundred to 
a few thousand record (< 3k) can be pretty quick (< 100 ms, often < 10 ms 
particularly if they are cached). Our organization basically uses the following 
approach: 'use Cassandra for generating continuous 10 second accuracy time 
series reports but MySQL and a production DB replica for any ad-hoc single 
value report the boss wants NOW'.


Based on what you have described, it sounds like you are thinking about your 
problem from a SQL-like point of view: store data once then 
query/filter/aggregate it in multiple different ways to obtain useful 
information. If possible try to leverage the power of Cassandra and store it in 
efficient and per-query pre-optimized forms. For example, I can imagine the 
average call duration being an important parameter in a system analyzing call 
data records. Instead of storing all the information about a call in one place, 
store the 'call duration' in a separate column family, each row containing a 
single integer representing call duarations for a given hour (column name being 
the TimeUUID). My metrics system does something similar to this and loads 
batches of 15,000 records (column slice) in < 200 ms. By parallelizing across 
10 threads loading from different rows, I can process the average, standard 
deviation and a factor roughly meaning 'how close to Gaussian' for 1 million 
records in < 5 seconds. 

To reiterate, Cassandra is not the solution if you are looking for 'Database: I 
command thee to give me the average of field x.' That said, I have found its 
overall data-processing capabilities to be reasonably impressive.

Dan

-----Original Message-----
From: Arun Cherian [mailto:[email protected]] 
Sent: December-10-10 16:43
To: [email protected]
Subject: Cassandra for Ad-hoc Aggregation and formula calculation

Hi,

I have been reading up on Cassandra for the past few weeks and I am
highly impressed by the features it offers. At work, we are starting
work on a product that will handle several million CDR (Call Data
Record, basically can be thought of as a .CSV file) per day. We will
have to store the data, and perform aggregations and calculations on
them. A few veteran RDBMS admin friends (we are a small .NET shop, we
don't have any in-house DB talent) recommended Infobright and noSQL to
us, and hence my search. I was wondering if Cassandra is a good fit
for

1. Storing several million data records per day (each record will be a
few KB in size) without any data loss.
2. Aggregation of certain fields in the stored records, like Avg
across time period.
3. Using certain existing fields to calculate new values on the fly
and store it too.
4. We were wondering if pre-aggregation was a good choice (calculating
aggregation per 1 min, 5 min, 15 min etc ahead of time) but in case we
need ad-hoc aggregation, does Cassandra support that over this amount
of data?

Thanks,
Arun
No virus found in this incoming message.
Checked by AVG - www.avg.com 
Version: 9.0.872 / Virus Database: 271.1.1/3307 - Release Date: 12/10/10 
02:37:00

RE: Cassandra for Ad-hoc Aggregation and formula calculation

Reply via email to