On Wed, May 19, 2010 at 7:15 PM, Torsten Curdt <tcu...@vafer.org> wrote:
> We are currently working on a prototype that is using Cassandra for > realtime-ish statistics system. This seems to be quite a common use > case. If people are interested - maybe it be worth collaborating on > this beyond design discussions on the list. But first let's me explain > our approach and where we could use some input. > > We are storing the raw events into minute buckets. > > <Minute Timestamp> => { > <UUID> => { 'id'=>1, 'attrA' => 'a1', 'attrB' => 'b1' }, > <UUID> => { 'id'=>2, 'attrA' => 'a2', 'attrB' => 'b1' } > ... > } > > The number of attributes are quite limited currently (below 20) and > for now we only plan to have no more than 1000 events per minute. So > this should be really a piece of cake for Cassandra. With this little > data using a super column should be no problem. > True, but with so few attributes you could serialize them to json or xml and use a regular columnfamily. You don't have a way to access per-minute attributes for a given id like this anyway, so there's really no reason to use a supercolumn. > Now the idea is to iterate over the minute buckets and build hour, > day, month and year aggregates. With that getting the totals across a > certain time frame isn't more than a few gets (or a multiget) and > summing it all up. I guess the idea is straight forward. > > One could use a super column to store and access the aggregated data > from the time buckets: > > <Hour Timestamp> => { > 'id/1' => { 'count' => 12 }, > 'id/2' => { 'count' => 21 } > ... > } > Do you just want to store overall counts, or counts per attribute? If the former, there's really no reason to use a supercolumn here. While this feels natuaral, the hierarchy might not be best choice for > with the current Cassandra if the number of different ids becomes too > large IIUC. One could also move the id part into the row key space > instead. > > <Hour Timestamp> + 'id/1' => 12 > <Hour Timestamp> + 'id/2' => 21 > If you're going to store per-attribute counts, this is the better approach imo. > ...at least as long as we don't have to access all data for one time > slot (like one hour in this case). (This should still be possible with > a row key range query though ...if the ordered partitioner is being > used) > You don't have to use the OPP if you maintain an index of the ids keyed by hour timestamp. Q: Is the only difference the limitation from the row size? What are > the performance considerations weighting in for one or the other > approach. Supercolumns are generally not as a fast as regular columns. > Does Cassandra first has to load the whole row into memory > before one can access e.g. "id/1" with the super column approach? > No, but it does have to load all the subcolumns into memory because they aren't indexed. I've built a system very similar to this without supercolumns, and my approach was to insert minute data into rows keyed by <minute timestamp + id> where the columns were UUIDs and the values were serialized json of all the attributes, while also maintaining an index of the ids keyed by an hour timestamp. Then I would build hourly totals keyed by <hour timestamp+id> where each column was an attribute, so I could slice a single attribute from many ids easily. From there aggregating at the daily/weekly/monthly level is relatively easy. -Brandon