Re: Cassandra for Ad-hoc Aggregation and formula calculation

Aaron Morton Sun, 12 Dec 2010 12:02:41 -0800

Nice email Dan.

I would also add if you are still in the initial stages take a look at 
Hadoop+Pig. If your source data is write once read many it may be a better fit, 
but then you would also need to calculate the aggregates and store them 
somewhere.


So Cassandra *may* be just what you want. The ability to keep large amounts of 
data online with a high performance, and remove other servers form your stack 
is a definite plus.

Aaron 
On 11/12/2010, at 7:01 PM, Dan Hendry <dan.hendry.j...@gmail.com> wrote:

> Perhaps other, more experienced and reputable contributors to this list can 
> comment but to be frank: Cassandra is probably not for you (at least for 
> now). I personally feel Cassandra is one of the stronger NoSQL options out 
> there and has the potential to become the defacto standard; but its not quite 
> there yet and does not inherently meet your requirements.
> 
> To give you some background, I started experimenting with Cassandra as a 
> personal project to avoid having to look through days worth of server logs 
> (and because I thought it was cool). The project ballooned and has become my 
> organizations primary metrics and analytics platform which currently 
> processes 200 million+ events/records per day. I doubt any traditional 
> database solution could have performed as well as Cassandra but the 
> development and operations process has not been without severe growing pains. 
> 
>> 1. Storing several million data records per day (each record will be a
>> few KB in size) without any data loss.
> 
> Absolutely, no problems on this front. A cluster of moderately beefy servers 
> will handle this with no complaints. As long as you are careful to avoid 
> hotspots in your data distribution, Cassandra truly is damn near linearly 
> scalable with hardware. 
> 
>> 2. Aggregation of certain fields in the stored records, like Avg
>> across time period.
> 
> Cassandra cannot do this on its own (by design and for good reason). There 
> have been efforts to add support for higher level data processing languages 
> (such as pig and hive) but they are not 'out of the box solutions' and in my 
> experience, difficult to get working properly. I ended up writing my own data 
> processing/report generation framework that works ridiculously well for my 
> particular case. In relation to your requirements, calculating averages 
> across fields would probably have to be implemented manually (and executed as 
> a periodic, automated task). Although non-trivial this isn’t quite as bad as 
> you might think.
> 
>> 3. Using certain existing fields to calculate new values on the fly
>> and store it too.
> 
> Not quite sure what you are asking here. To go back to the last point to 
> calculate anything new, you are probably going to have to load all the 
> records on which that calculation depends into a separate process/server. 
> Generally, I would say Cassandra isn’t particularly good at 'on the fly' data 
> aggregation tasks (certainly not at all to the extent an SQL database is). To 
> be fair, thats also not what it is explicitly designed for or advertised to 
> do well. 
> 
>> 4. We were wondering if pre-aggregation was a good choice (calculating
>> aggregation per 1 min, 5 min, 15 min etc ahead of time) but in case we
>> need ad-hoc aggregation, does Cassandra support that over this amount
>> of data?
> 
> Cassandra is GREAT for accessing/storing/retrieving/post-processing anything 
> that can be pre-computed. If you have been doing any amount of reading, you 
> will likely have heard that in SQL you model data, in Cassandra (and most 
> other NoSQL databases) you model your queries (sorry for ripping off whoever 
> said this originally). If there is one thing/concept I can say that I have 
> learned about Cassandra is pre-compute (or asynchronously compute) anything 
> you possibly can and don’t be afraid to write a ridiculous amount to the 
> Cassandra database. In terms of ad-hoc aggregation, there is no nice simple 
> scripting language for Cassandra data processing (eg SQL). That said, you can 
> do most things pretty quick with a bit of code. Consider that loading a few 
> hundred to a few thousand record (< 3k) can be pretty quick (< 100 ms, often 
> < 10 ms particularly if they are cached). Our organization basically uses the 
> following approach: 'use Cassandra for generating continuous 10 second 
> accuracy time series reports but MySQL and a production DB replica for any 
> ad-hoc single value report the boss wants NOW'.
> 
> 
> Based on what you have described, it sounds like you are thinking about your 
> problem from a SQL-like point of view: store data once then 
> query/filter/aggregate it in multiple different ways to obtain useful 
> information. If possible try to leverage the power of Cassandra and store it 
> in efficient and per-query pre-optimized forms. For example, I can imagine 
> the average call duration being an important parameter in a system analyzing 
> call data records. Instead of storing all the information about a call in one 
> place, store the 'call duration' in a separate column family, each row 
> containing a single integer representing call duarations for a given hour 
> (column name being the TimeUUID). My metrics system does something similar to 
> this and loads batches of 15,000 records (column slice) in < 200 ms. By 
> parallelizing across 10 threads loading from different rows, I can process 
> the average, standard deviation and a factor roughly meaning 'how close to 
> Gaussian' for 1 million records in < 5 seconds. 
> 
> To reiterate, Cassandra is not the solution if you are looking for 'Database: 
> I command thee to give me the average of field x.' That said, I have found 
> its overall data-processing capabilities to be reasonably impressive.
> 
> Dan
> 
> -----Original Message-----
> From: Arun Cherian [mailto:archer...@gmail.com] 
> Sent: December-10-10 16:43
> To: user@cassandra.apache.org
> Subject: Cassandra for Ad-hoc Aggregation and formula calculation
> 
> Hi,
> 
> I have been reading up on Cassandra for the past few weeks and I am
> highly impressed by the features it offers. At work, we are starting
> work on a product that will handle several million CDR (Call Data
> Record, basically can be thought of as a .CSV file) per day. We will
> have to store the data, and perform aggregations and calculations on
> them. A few veteran RDBMS admin friends (we are a small .NET shop, we
> don't have any in-house DB talent) recommended Infobright and noSQL to
> us, and hence my search. I was wondering if Cassandra is a good fit
> for
> 
> 1. Storing several million data records per day (each record will be a
> few KB in size) without any data loss.
> 2. Aggregation of certain fields in the stored records, like Avg
> across time period.
> 3. Using certain existing fields to calculate new values on the fly
> and store it too.
> 4. We were wondering if pre-aggregation was a good choice (calculating
> aggregation per 1 min, 5 min, 15 min etc ahead of time) but in case we
> need ad-hoc aggregation, does Cassandra support that over this amount
> of data?
> 
> Thanks,
> Arun
> No virus found in this incoming message.
> Checked by AVG - www.avg.com 
> Version: 9.0.872 / Virus Database: 271.1.1/3307 - Release Date: 12/10/10 
> 02:37:00
>

Re: Cassandra for Ad-hoc Aggregation and formula calculation

Reply via email to