Ryan,

Thanks for the insight.  FWIW, SimpleGeo's use cases are very similar
do the 2nd use case Ryan mentioned.  We want to do rollups by time,
geography, and facet of a customer's record.  The most important
benefit Cassandra brings for us is the ability to handle large number
of rows (very detailed rollups).  Secondary is the ability to
increment at high volume (the increment buffering that Ryan has
mentioned seems highly valuable).

Dirty Burritos, Inc.
<# burritos sold> in <San Francisco> by <neighborhood>, <meat>, and <size>

Mission
  Total: 1,726
    Meat
      Chicken: 765
      Beef: 620
      Chorizo: 173
      Veggie: 168
SOMA
  Total: 1,526
    Meat
      Chicken: 665
      Beef: 520
      Chorizo: 173
      Veggie: 168
Marina
  Total: 1,326
    Meat
      Chicken: 565
      Beef: 420
      Chorizo: 173
      Veggie: 168

We would roll up by many different time periods (minutes, hours,
days), geographic boundaries (neighborhod, zip, city, state), metrics
(# burritos sold, order total, delivery time), and properties (meat,
male/female, order size).

With a smart schema, I think we can store and update this data in
real-time and make it reasonably query-able, and it will be much
simpler and easier than batch processing.  This kind of reporting
isn't novel or special, but the cost to produce this data become
extremely low when you don't have to futz with Hadoop, batch
processing, broken jobs, etc.

We have looked at a few ways to store each increment in a new column,
and possibly have some kind of high-level compaction that comes
through and cleans it up, but it just become unwieldy at the app
level.

We plan on messing with #1072 in the very new future, as well as
offering to beta test the increment buffering Ryan has mentioned.

-Ben Standefer


On Wed, Oct 6, 2010 at 6:30 PM, Ryan King <r...@twitter.com> wrote:
> In the spirit of making sure we have clear communication about our
> work, I'd like to outline the use cases Twitter has for distributed
> counters. I expect that many of you using Cassandra currently or in
> the future will have similar use cases.
>
> The first use case is pretty simple: high scale counters. Our Tweet
> button [1] is powered by #1072 counters. We could every mention of
> every url that comes through a public tweet. As you would expect,
> there are a lot of urls and a lot of traffic to this widget (its on
> many high traffic sites, though it is highly cached).
>
> The second is a bit more complex: time series data. We have built
> infrastructure that can process logs (in real time from scribe) or
> other events  and convert them into a series of keys to increment,
> buffer the data for 1 minute and increment those keys. For logs, each
> aggregator would do its on increment (so per thing you're tracking you
> get an increment for each aggregator), but for events it'll be one
> increment per event. We plan to open source all of this soon.
>
> We're hoping to soon start replacing our ganglia clusters with this.
> For the ganglia use-case we end up with a large number or increments
> for every read. For monitoring data, even a reasonably sized fleet
> with a moderate number of metrics can generate a huge amount of data.
> Imagine you have 500 machines (not how many we have) and measure 300
> (a reasonable estimate based on our experience) metrics per machine.
> Suppose you want to measure these things every minute and roll the
> values up every hour, day, month and for all time. Suppose also that
> you were tracking sum, count, min, max, and sum of squares (so that
> you can do standard deviation).  You also want to track these metrics
> across groups like web hosts, databases, datacenters, etc.
>
> These basic assumptions would mean this kind of traffic:
>
> (500      +  100  ) *   300   *      5             *     4
> 3,600,000 increments/minute
> (machines   groups)    metrics  time granularities   aggregates
>
> Read traffic, being employee-only would be negligible compared to this.
>
> One other use case is that for many of the metrics we track, we want
> to track the usage across several facets.
>
> For example [2] to build our local trends feature, you could store a
> time series of terms per city. In this case supercolumns would be a
> natural fit because the set of facets is unknown and open:
>
> Imagine a CF that has data like this:
>
> city0 => hour0 => { term1 => 2, term2 => 1000, term3 => 1}, hour1 => {
> term5 => 2, term2 => 10}
> city1 => hour0 => { term12 => 3, term0 => 500, term3 => 1}, hour1 => {
> term5 => 2, term2 => 10}
>
> Of course, there are some other ways to model this data– you could
> collapse the subcolumn names into the column names and re-do how you
> slice (you have to slice anyway). You have to have fixed width terms
> then, though:
>
> city0 => { hour0 + term1 => 2, hour0 + term2 => 1000, hour0 + term3 =>
> 1}, hour1 => { hour1 + term5 => 2, hour1 + term2 => 10}
> city1 => { hour0 + term12 => 3, hour0 + term0 => 500, hour0 + term3 =>
> 1}, hour1 => { hour1 + term5 => 2, hour1 + term2 => 10}
>
> This is doable, but could be rough.
>
> The other option is to have a separate row for each facet (with a
> compound key of [city, term]), and build a custom comparator that only
> looks at the first part for generating the token, they we have to do
> range slices to get all the facets. Again, doable, but not pretty.
>
>
> -ryan
>
>
> 1. http://twitter.com/goodies/tweetbutton
> 2. this is not how we actually do this, but it would be a reasonable approach.
>

Reply via email to