We are currently working on a prototype that is using Cassandra for realtime-ish statistics system. This seems to be quite a common use case. If people are interested - maybe it be worth collaborating on this beyond design discussions on the list. But first let's me explain our approach and where we could use some input.
We are storing the raw events into minute buckets. <Minute Timestamp> => { <UUID> => { 'id'=>1, 'attrA' => 'a1', 'attrB' => 'b1' }, <UUID> => { 'id'=>2, 'attrA' => 'a2', 'attrB' => 'b1' } ... } The number of attributes are quite limited currently (below 20) and for now we only plan to have no more than 1000 events per minute. So this should be really a piece of cake for Cassandra. With this little data using a super column should be no problem. Now the idea is to iterate over the minute buckets and build hour, day, month and year aggregates. With that getting the totals across a certain time frame isn't more than a few gets (or a multiget) and summing it all up. I guess the idea is straight forward. One could use a super column to store and access the aggregated data from the time buckets: <Hour Timestamp> => { 'id/1' => { 'count' => 12 }, 'id/2' => { 'count' => 21 } ... } While this feels natuaral, the hierarchy might not be best choice for with the current Cassandra if the number of different ids becomes too large IIUC. One could also move the id part into the row key space instead. <Hour Timestamp> + 'id/1' => 12 <Hour Timestamp> + 'id/2' => 21 ...at least as long as we don't have to access all data for one time slot (like one hour in this case). (This should still be possible with a row key range query though ...if the ordered partitioner is being used) Q: Is the only difference the limitation from the row size? What are the performance considerations weighting in for one or the other approach. Does Cassandra first has to load the whole row into memory before one can access e.g. "id/1" with the super column approach? cheers -- Torsten