In the spirit of making sure we have clear communication about our work, I'd like to outline the use cases Twitter has for distributed counters. I expect that many of you using Cassandra currently or in the future will have similar use cases.
The first use case is pretty simple: high scale counters. Our Tweet button [1] is powered by #1072 counters. We could every mention of every url that comes through a public tweet. As you would expect, there are a lot of urls and a lot of traffic to this widget (its on many high traffic sites, though it is highly cached). The second is a bit more complex: time series data. We have built infrastructure that can process logs (in real time from scribe) or other events and convert them into a series of keys to increment, buffer the data for 1 minute and increment those keys. For logs, each aggregator would do its on increment (so per thing you're tracking you get an increment for each aggregator), but for events it'll be one increment per event. We plan to open source all of this soon. We're hoping to soon start replacing our ganglia clusters with this. For the ganglia use-case we end up with a large number or increments for every read. For monitoring data, even a reasonably sized fleet with a moderate number of metrics can generate a huge amount of data. Imagine you have 500 machines (not how many we have) and measure 300 (a reasonable estimate based on our experience) metrics per machine. Suppose you want to measure these things every minute and roll the values up every hour, day, month and for all time. Suppose also that you were tracking sum, count, min, max, and sum of squares (so that you can do standard deviation). You also want to track these metrics across groups like web hosts, databases, datacenters, etc. These basic assumptions would mean this kind of traffic: (500 + 100 ) * 300 * 5 * 4 3,600,000 increments/minute (machines groups) metrics time granularities aggregates Read traffic, being employee-only would be negligible compared to this. One other use case is that for many of the metrics we track, we want to track the usage across several facets. For example [2] to build our local trends feature, you could store a time series of terms per city. In this case supercolumns would be a natural fit because the set of facets is unknown and open: Imagine a CF that has data like this: city0 => hour0 => { term1 => 2, term2 => 1000, term3 => 1}, hour1 => { term5 => 2, term2 => 10} city1 => hour0 => { term12 => 3, term0 => 500, term3 => 1}, hour1 => { term5 => 2, term2 => 10} Of course, there are some other ways to model this data– you could collapse the subcolumn names into the column names and re-do how you slice (you have to slice anyway). You have to have fixed width terms then, though: city0 => { hour0 + term1 => 2, hour0 + term2 => 1000, hour0 + term3 => 1}, hour1 => { hour1 + term5 => 2, hour1 + term2 => 10} city1 => { hour0 + term12 => 3, hour0 + term0 => 500, hour0 + term3 => 1}, hour1 => { hour1 + term5 => 2, hour1 + term2 => 10} This is doable, but could be rough. The other option is to have a separate row for each facet (with a compound key of [city, term]), and build a custom comparator that only looks at the first part for generating the token, they we have to do range slices to get all the facets. Again, doable, but not pretty. -ryan 1. http://twitter.com/goodies/tweetbutton 2. this is not how we actually do this, but it would be a reasonable approach.