http://www.youtube.com/watch?v=eaCCkfjPm0o 3.30 song begins 4.00 starfish loves you and Cassandra loves you!
On Thu, May 6, 2010 at 11:03 AM, Denis Haskin <de...@haskinferguson.net>wrote: > i can haz hints pleez? > > On Wed, May 5, 2010 at 9:28 PM, philip andrew <philip14...@gmail.com> > wrote: > > Starfish loves you. > > > > On Wed, May 5, 2010 at 1:16 PM, David Strauss <da...@fourkitchens.com> > > wrote: > >> > >> On 2010-05-05 04:50, Denis Haskin wrote: > >> > I've been reading everything I can get my hands on about Cassandra and > >> > it sounds like a possibly very good framework for our data needs; I'm > >> > about to take the plunge and do some prototyping, but I thought I'd > >> > see if I can get a reality check here on whether it makes sense. > >> > > >> > Our schema should be fairly simple; we may only keep our original data > >> > in Cassandra, and the rollups and analyzed results in a relational db > >> > (although this is still open for discussion). > >> > >> This is what we do on some projects. This is a particularly nice > >> strategy if the raw : aggregated ratio is really high or the raw data is > >> bursty or highly volatile. > >> > >> Consider Hadoop integration for your aggregation needs. > >> > >> > We have fairly small records: 120-150 bytes, in maybe 18 columns. > >> > Data is additive only; we would rarely, if ever, be deleting data. > >> > >> Cassandra loves you. > >> > >> > Our core data set will accumulate at somewhere between 14 and 27 > >> > million rows per day; we'll be starting with about a year and a half > >> > of data (7.5 - 15 billion rows) and eventually would like to keep 5 > >> > years online (25 to 50 billion rows). (So that's maybe 1.3TB or so > >> > per year, data only. Not sure about the overhead yet.) > >> > > >> > Ideally we'd like to also have a cluster with our complete data set, > >> > which is maybe 38 billion rows per year (we could live with less than > >> > 5 years of that). > >> > > >> > I haven't really thought through what the schema's going to be; our > >> > primary key is an entity's ID plus a timestamp. But there's 2 or 3 > >> > other retrieval paths we'll need to support as well. > >> > >> Generally, you do multiple retrieval paths through denormalization in > >> Cassandra. > >> > >> > Thoughts? Pitfalls? Gotchas? Are we completely whacked? > >> > >> Does the random partitioner support what you need? > >> > >> -- > >> David Strauss > >> | da...@fourkitchens.com > >> Four Kitchens > >> | http://fourkitchens.com > >> | +1 512 454 6659 [office] > >> | +1 512 870 8453 [direct] > >> > > > > > > > > -- > dwh >