Ryan, Thanks for the insight. FWIW, SimpleGeo's use cases are very similar do the 2nd use case Ryan mentioned. We want to do rollups by time, geography, and facet of a customer's record. The most important benefit Cassandra brings for us is the ability to handle large number of rows (very detailed rollups). Secondary is the ability to increment at high volume (the increment buffering that Ryan has mentioned seems highly valuable).
Dirty Burritos, Inc. <# burritos sold> in <San Francisco> by <neighborhood>, <meat>, and <size> Mission Total: 1,726 Meat Chicken: 765 Beef: 620 Chorizo: 173 Veggie: 168 SOMA Total: 1,526 Meat Chicken: 665 Beef: 520 Chorizo: 173 Veggie: 168 Marina Total: 1,326 Meat Chicken: 565 Beef: 420 Chorizo: 173 Veggie: 168 We would roll up by many different time periods (minutes, hours, days), geographic boundaries (neighborhod, zip, city, state), metrics (# burritos sold, order total, delivery time), and properties (meat, male/female, order size). With a smart schema, I think we can store and update this data in real-time and make it reasonably query-able, and it will be much simpler and easier than batch processing. This kind of reporting isn't novel or special, but the cost to produce this data become extremely low when you don't have to futz with Hadoop, batch processing, broken jobs, etc. We have looked at a few ways to store each increment in a new column, and possibly have some kind of high-level compaction that comes through and cleans it up, but it just become unwieldy at the app level. We plan on messing with #1072 in the very new future, as well as offering to beta test the increment buffering Ryan has mentioned. -Ben Standefer On Wed, Oct 6, 2010 at 6:30 PM, Ryan King <r...@twitter.com> wrote: > In the spirit of making sure we have clear communication about our > work, I'd like to outline the use cases Twitter has for distributed > counters. I expect that many of you using Cassandra currently or in > the future will have similar use cases. > > The first use case is pretty simple: high scale counters. Our Tweet > button [1] is powered by #1072 counters. We could every mention of > every url that comes through a public tweet. As you would expect, > there are a lot of urls and a lot of traffic to this widget (its on > many high traffic sites, though it is highly cached). > > The second is a bit more complex: time series data. We have built > infrastructure that can process logs (in real time from scribe) or > other events and convert them into a series of keys to increment, > buffer the data for 1 minute and increment those keys. For logs, each > aggregator would do its on increment (so per thing you're tracking you > get an increment for each aggregator), but for events it'll be one > increment per event. We plan to open source all of this soon. > > We're hoping to soon start replacing our ganglia clusters with this. > For the ganglia use-case we end up with a large number or increments > for every read. For monitoring data, even a reasonably sized fleet > with a moderate number of metrics can generate a huge amount of data. > Imagine you have 500 machines (not how many we have) and measure 300 > (a reasonable estimate based on our experience) metrics per machine. > Suppose you want to measure these things every minute and roll the > values up every hour, day, month and for all time. Suppose also that > you were tracking sum, count, min, max, and sum of squares (so that > you can do standard deviation). You also want to track these metrics > across groups like web hosts, databases, datacenters, etc. > > These basic assumptions would mean this kind of traffic: > > (500 + 100 ) * 300 * 5 * 4 > 3,600,000 increments/minute > (machines groups) metrics time granularities aggregates > > Read traffic, being employee-only would be negligible compared to this. > > One other use case is that for many of the metrics we track, we want > to track the usage across several facets. > > For example [2] to build our local trends feature, you could store a > time series of terms per city. In this case supercolumns would be a > natural fit because the set of facets is unknown and open: > > Imagine a CF that has data like this: > > city0 => hour0 => { term1 => 2, term2 => 1000, term3 => 1}, hour1 => { > term5 => 2, term2 => 10} > city1 => hour0 => { term12 => 3, term0 => 500, term3 => 1}, hour1 => { > term5 => 2, term2 => 10} > > Of course, there are some other ways to model this data– you could > collapse the subcolumn names into the column names and re-do how you > slice (you have to slice anyway). You have to have fixed width terms > then, though: > > city0 => { hour0 + term1 => 2, hour0 + term2 => 1000, hour0 + term3 => > 1}, hour1 => { hour1 + term5 => 2, hour1 + term2 => 10} > city1 => { hour0 + term12 => 3, hour0 + term0 => 500, hour0 + term3 => > 1}, hour1 => { hour1 + term5 => 2, hour1 + term2 => 10} > > This is doable, but could be rough. > > The other option is to have a separate row for each facet (with a > compound key of [city, term]), and build a custom comparator that only > looks at the first part for generating the token, they we have to do > range slices to get all the facets. Again, doable, but not pretty. > > > -ryan > > > 1. http://twitter.com/goodies/tweetbutton > 2. this is not how we actually do this, but it would be a reasonable approach. >