If you want to do range queries on the keys, you can use OPP to do this: (example using UTF-8 lexicographic keys, with bursts split across rows according to row size limits)
Events: { "20100601.05.30.003": { "20100601.05.30.003": <value> "20100601.05.30.007": <value> ... } } With a future version of Cassandra, you may be able to use the same basic datatype for both key and column name, as keys will be binary like the rest, I believe. I'm not aware of specific performance improvements when using OPP range queries on keys vs iterating over known keys. I suspect (hope) that round-tripping to the server should be reduced, which may be significant. Does anybody have decent benchmarks that tell the difference? On Wed, Jun 2, 2010 at 11:53 AM, Ben Browning <ben...@gmail.com> wrote: > With a traffic pattern like that, you may be better off storing the > events of each burst (I'll call them group) in one or more keys and > then storing these keys in the day key. > > EventGroupsPerDay: { > "20100601": { > 123456789: "group123", // column name is timestamp group was > received, column value is key > 123456790: "group124" > } > } > > EventGroups: { > "group123": { > 123456789: "value1", > 123456799: "value2" > } > } > > If you think of Cassandra as a toolkit for building scalable indexes > it seems to make the modeling a bit easier. In this case, you're > building an index by day to lookup events that come in as groups. So, > first you'd fetch the slice of columns for the day you're interested > in to figure out which groups to look at then you'd fetch the events > in those groups. > > There are plenty of alternate ways to divide up the data among rows > also - you could use hour keys instead of days as an example. > > On Wed, Jun 2, 2010 at 11:57 AM, David Boxenhorn <da...@lookin2.com> wrote: >> Let's say you're logging events, and you have billions of events. What if >> the events come in bursts, so within a day there are millions of events, but >> they all come within microseconds of each other a few times a day? How do >> you find the events that happened on a particular day if you can't store >> them all in one row? >> >> On Wed, Jun 2, 2010 at 6:45 PM, Jonathan Shook <jsh...@gmail.com> wrote: >>> >>> Either OPP by key, or within a row by column name. I'd suggest the latter. >>> If you have structured data to stick under a column (named by the >>> timestamp), then you can serialize and unserialize it yourself, or you >>> can use a supercolumn. It's effectively the same thing. Cassandra >>> only provides the super column support as a convenience layer as it is >>> currently implemented. That may change in the future. >>> >>> You didn't make clear in your question why a standard column would be >>> less suitable. I presumed you had layered structure within the >>> timestamp, hence my response. >>> How would you logically partition your dataset according to natural >>> application boundaries? This will answer most of your question. >>> If you have a dataset which can't be partitioned into a reasonable >>> size row, then you may want to use OPP and key concatenation. >>> >>> What do you mean by giant? >>> >>> On Wed, Jun 2, 2010 at 10:32 AM, David Boxenhorn <da...@lookin2.com> >>> wrote: >>> > How do I handle giant sets of ordered data, e.g. by timestamps, which I >>> > want >>> > to access by range? >>> > >>> > I can't put all the data into a supercolumn, because it's loaded into >>> > memory >>> > at once, and it's too much data. >>> > >>> > Am I forced to use an order-preserving partitioner? I don't want the >>> > headache. Is there any other way? >>> > >> >> >