We are currently working on a prototype that is using Cassandra for
realtime-ish statistics system. This seems to be quite a common use
case. If people are interested - maybe it be worth collaborating on
this beyond design discussions on the list. But first let's me explain
our approach and where we could use some input.

We are storing the raw events into minute buckets.

    <Minute Timestamp> => {
        <UUID> => { 'id'=>1, 'attrA' => 'a1', 'attrB' => 'b1' },
        <UUID> => { 'id'=>2, 'attrA' => 'a2', 'attrB' => 'b1' }

The number of attributes are quite limited currently (below 20) and
for now we only plan to have no more than 1000 events per minute. So
this should be really a piece of cake for Cassandra. With this little
data using a super column should be no problem.

Now the idea is to iterate over the minute buckets and build hour,
day, month and year aggregates. With that getting the totals across a
certain time frame isn't more than a few gets (or a multiget) and
summing it all up. I guess the idea is straight forward.

One could use a super column to store and access the aggregated data
from the time buckets:

    <Hour Timestamp> => {
        'id/1' => { 'count' => 12 },
        'id/2' => { 'count' => 21 }

While this feels natuaral, the hierarchy might not be best choice for
with the current Cassandra if the number of different ids becomes too
large IIUC. One could also move the id part into the row key space

    <Hour Timestamp> + 'id/1' => 12
    <Hour Timestamp> + 'id/2' => 21

...at least as long as we don't have to access all data for one time
slot (like one hour in this case). (This should still be possible with
a row key range query though ...if the ordered partitioner is being

Q: Is the only difference the limitation from the row size? What are
the performance considerations weighting in for one or the other
approach. Does Cassandra first has to load the whole row into memory
before one can access e.g. "id/1" with the super column approach?


Reply via email to