Insert "if you want to use long values for keys and column names" above paragraph 2. I forgot that part.
On Wed, Jun 2, 2010 at 1:29 PM, Jonathan Shook <jsh...@gmail.com> wrote: > If you want to do range queries on the keys, you can use OPP to do this: > (example using UTF-8 lexicographic keys, with bursts split across rows > according to row size limits) > > Events: { > "20100601.05.30.003": { > "20100601.05.30.003": <value> > "20100601.05.30.007": <value> > ... > } > } > > With a future version of Cassandra, you may be able to use the same > basic datatype for both key and column name, as keys will be binary > like the rest, I believe. > > I'm not aware of specific performance improvements when using OPP > range queries on keys vs iterating over known keys. I suspect (hope) > that round-tripping to the server should be reduced, which may be > significant. Does anybody have decent benchmarks that tell the > difference? > > > On Wed, Jun 2, 2010 at 11:53 AM, Ben Browning <ben...@gmail.com> wrote: >> With a traffic pattern like that, you may be better off storing the >> events of each burst (I'll call them group) in one or more keys and >> then storing these keys in the day key. >> >> EventGroupsPerDay: { >> "20100601": { >> 123456789: "group123", // column name is timestamp group was >> received, column value is key >> 123456790: "group124" >> } >> } >> >> EventGroups: { >> "group123": { >> 123456789: "value1", >> 123456799: "value2" >> } >> } >> >> If you think of Cassandra as a toolkit for building scalable indexes >> it seems to make the modeling a bit easier. In this case, you're >> building an index by day to lookup events that come in as groups. So, >> first you'd fetch the slice of columns for the day you're interested >> in to figure out which groups to look at then you'd fetch the events >> in those groups. >> >> There are plenty of alternate ways to divide up the data among rows >> also - you could use hour keys instead of days as an example. >> >> On Wed, Jun 2, 2010 at 11:57 AM, David Boxenhorn <da...@lookin2.com> wrote: >>> Let's say you're logging events, and you have billions of events. What if >>> the events come in bursts, so within a day there are millions of events, but >>> they all come within microseconds of each other a few times a day? How do >>> you find the events that happened on a particular day if you can't store >>> them all in one row? >>> >>> On Wed, Jun 2, 2010 at 6:45 PM, Jonathan Shook <jsh...@gmail.com> wrote: >>>> >>>> Either OPP by key, or within a row by column name. I'd suggest the latter. >>>> If you have structured data to stick under a column (named by the >>>> timestamp), then you can serialize and unserialize it yourself, or you >>>> can use a supercolumn. It's effectively the same thing. Cassandra >>>> only provides the super column support as a convenience layer as it is >>>> currently implemented. That may change in the future. >>>> >>>> You didn't make clear in your question why a standard column would be >>>> less suitable. I presumed you had layered structure within the >>>> timestamp, hence my response. >>>> How would you logically partition your dataset according to natural >>>> application boundaries? This will answer most of your question. >>>> If you have a dataset which can't be partitioned into a reasonable >>>> size row, then you may want to use OPP and key concatenation. >>>> >>>> What do you mean by giant? >>>> >>>> On Wed, Jun 2, 2010 at 10:32 AM, David Boxenhorn <da...@lookin2.com> >>>> wrote: >>>> > How do I handle giant sets of ordered data, e.g. by timestamps, which I >>>> > want >>>> > to access by range? >>>> > >>>> > I can't put all the data into a supercolumn, because it's loaded into >>>> > memory >>>> > at once, and it's too much data. >>>> > >>>> > Am I forced to use an order-preserving partitioner? I don't want the >>>> > headache. Is there any other way? >>>> > >>> >>> >> >