Given that your current schema has ~18 small columns per row, adding a level by using supercolumns may make sense for you because the limitation of unserializing a whole supercolumn at once isn't going to be a problem for you.
20K supercolumns per row with ~18 small subcolumns each is completely reasonable. The (super)columns within each row will be ordered, and you can use the much-easier-to-administer RandomPartitioner. On 2010-05-05 11:22, Denis Haskin wrote: > David -- thanks for the thoughts. > > In re: your question >> Does the random partitioner support what you need? > > I guess my answer is "I'm not sure yet", but also my initial thought > was that we'd use the (or a) OrderPreservingPartitioner so that we > could use range scans and that rows for a given entity would be > co-located (if I'm understanding Cassandra's storage architecture > properly). But that may be a naive approach. > > In our core data set, we have maybe 20,000 entities about which we are > storing time-series data (and its fairly well distributed across these > entities). Occurs to me it's also possible to store a entity per row, > with the time-series data as (or in?) super columns (and maybe it > would make sense to break those out in column families by date range). > I'd have to think through a little more what that might mean for our > secondary indexing needs. > > Thanks, > > dwh > > > > On Wed, May 5, 2010 at 1:16 AM, David Strauss <da...@fourkitchens.com> wrote: >> On 2010-05-05 04:50, Denis Haskin wrote: >>> I've been reading everything I can get my hands on about Cassandra and >>> it sounds like a possibly very good framework for our data needs; I'm >>> about to take the plunge and do some prototyping, but I thought I'd >>> see if I can get a reality check here on whether it makes sense. >>> >>> Our schema should be fairly simple; we may only keep our original data >>> in Cassandra, and the rollups and analyzed results in a relational db >>> (although this is still open for discussion). >> >> This is what we do on some projects. This is a particularly nice >> strategy if the raw : aggregated ratio is really high or the raw data is >> bursty or highly volatile. >> >> Consider Hadoop integration for your aggregation needs. >> >>> We have fairly small records: 120-150 bytes, in maybe 18 columns. >>> Data is additive only; we would rarely, if ever, be deleting data. >> >> Cassandra loves you. >> >>> Our core data set will accumulate at somewhere between 14 and 27 >>> million rows per day; we'll be starting with about a year and a half >>> of data (7.5 - 15 billion rows) and eventually would like to keep 5 >>> years online (25 to 50 billion rows). (So that's maybe 1.3TB or so >>> per year, data only. Not sure about the overhead yet.) >>> >>> Ideally we'd like to also have a cluster with our complete data set, >>> which is maybe 38 billion rows per year (we could live with less than >>> 5 years of that). >>> >>> I haven't really thought through what the schema's going to be; our >>> primary key is an entity's ID plus a timestamp. But there's 2 or 3 >>> other retrieval paths we'll need to support as well. >> >> Generally, you do multiple retrieval paths through denormalization in >> Cassandra. >> >>> Thoughts? Pitfalls? Gotchas? Are we completely whacked? >> >> Does the random partitioner support what you need? >> >> -- >> David Strauss >> | da...@fourkitchens.com >> Four Kitchens >> | http://fourkitchens.com >> | +1 512 454 6659 [office] >> | +1 512 870 8453 [direct] >> >> -- David Strauss | da...@fourkitchens.com | +1 512 577 5827 [mobile] Four Kitchens | http://fourkitchens.com | +1 512 454 6659 [office] | +1 512 870 8453 [direct]
signature.asc
Description: OpenPGP digital signature