On Wed, May 19, 2010 at 7:15 PM, Torsten Curdt <tcu...@vafer.org> wrote:

> We are currently working on a prototype that is using Cassandra for
> realtime-ish statistics system. This seems to be quite a common use
> case. If people are interested - maybe it be worth collaborating on
> this beyond design discussions on the list. But first let's me explain
> our approach and where we could use some input.
>
> We are storing the raw events into minute buckets.
>
>    <Minute Timestamp> => {
>        <UUID> => { 'id'=>1, 'attrA' => 'a1', 'attrB' => 'b1' },
>        <UUID> => { 'id'=>2, 'attrA' => 'a2', 'attrB' => 'b1' }
>        ...
>    }
>
> The number of attributes are quite limited currently (below 20) and
> for now we only plan to have no more than 1000 events per minute. So
> this should be really a piece of cake for Cassandra. With this little
> data using a super column should be no problem.
>

True, but with so few attributes you could serialize them to json or xml and
use a regular columnfamily.  You don't have a way to access per-minute
attributes for a given id like this anyway, so there's really no reason to
use a supercolumn.


> Now the idea is to iterate over the minute buckets and build hour,
> day, month and year aggregates. With that getting the totals across a
> certain time frame isn't more than a few gets (or a multiget) and
> summing it all up. I guess the idea is straight forward.
>
> One could use a super column to store and access the aggregated data
> from the time buckets:
>
>    <Hour Timestamp> => {
>        'id/1' => { 'count' => 12 },
>        'id/2' => { 'count' => 21 }
>        ...
>    }
>

Do you just want to store overall counts, or counts per attribute?  If the
former, there's really no reason to use a supercolumn here.

While this feels natuaral, the hierarchy might not be best choice for
> with the current Cassandra if the number of different ids becomes too
> large IIUC. One could also move the id part into the row key space
> instead.
>
>    <Hour Timestamp> + 'id/1' => 12
>    <Hour Timestamp> + 'id/2' => 21
>

If you're going to store per-attribute counts, this is the better approach
imo.


> ...at least as long as we don't have to access all data for one time
> slot (like one hour in this case). (This should still be possible with
> a row key range query though ...if the ordered partitioner is being
> used)
>

You don't have to use the OPP if you maintain an index of the ids keyed by
hour timestamp.

Q: Is the only difference the limitation from the row size? What are
> the performance considerations weighting in for one or the other
> approach.


Supercolumns are generally not as a fast as regular columns.


> Does Cassandra first has to load the whole row into memory
> before one can access e.g. "id/1" with the super column approach?
>

No, but it does have to load all the subcolumns into memory because they
aren't indexed.

I've built a system very similar to this without supercolumns, and my
approach was to insert minute data into rows keyed by <minute timestamp +
id> where the columns were UUIDs and the values were serialized json of all
the attributes, while also maintaining an index of the ids keyed by an hour
timestamp.  Then I would build hourly totals keyed by <hour timestamp+id>
where each column was an attribute, so I could slice a single attribute from
many ids easily.  From there aggregating at the daily/weekly/monthly level
is relatively easy.

-Brandon

Reply via email to