Re: Giant sets of ordered data

Ben Browning Wed, 02 Jun 2010 09:54:23 -0700

With a traffic pattern like that, you may be better off storing the
events of each burst (I'll call them group) in one or more keys and
then storing these keys in the day key.


EventGroupsPerDay: {
  "20100601": {
    123456789: "group123", // column name is timestamp group was
received, column value is key
    123456790: "group124"
  }
}

EventGroups: {
  "group123": {
    123456789: "value1",
    123456799: "value2"
   }
}

If you think of Cassandra as a toolkit for building scalable indexes
it seems to make the modeling a bit easier. In this case, you're
building an index by day to lookup events that come in as groups. So,
first you'd fetch the slice of columns for the day you're interested
in to figure out which groups to look at then you'd fetch the events
in those groups.

There are plenty of alternate ways to divide up the data among rows
also - you could use hour keys instead of days as an example.

On Wed, Jun 2, 2010 at 11:57 AM, David Boxenhorn <da...@lookin2.com> wrote:
> Let's say you're logging events, and you have billions of events. What if
> the events come in bursts, so within a day there are millions of events, but
> they all come within microseconds of each other a few times a day? How do
> you find the events that happened on a particular day if you can't store
> them all in one row?
>
> On Wed, Jun 2, 2010 at 6:45 PM, Jonathan Shook <jsh...@gmail.com> wrote:
>>
>> Either OPP by key, or within a row by column name. I'd suggest the latter.
>> If you have structured data to stick under a column (named by the
>> timestamp), then you can serialize and unserialize it yourself, or you
>> can use a supercolumn. It's effectively the same thing.  Cassandra
>> only provides the super column support as a convenience layer as it is
>> currently implemented. That may change in the future.
>>
>> You didn't make clear in your question why a standard column would be
>> less suitable. I presumed you had layered structure within the
>> timestamp, hence my response.
>> How would you logically partition your dataset according to natural
>> application boundaries? This will answer most of your question.
>> If you have a dataset which can't be partitioned into a reasonable
>> size row, then you may want to use OPP and key concatenation.
>>
>> What do you mean by giant?
>>
>> On Wed, Jun 2, 2010 at 10:32 AM, David Boxenhorn <da...@lookin2.com>
>> wrote:
>> > How do I handle giant sets of ordered data, e.g. by timestamps, which I
>> > want
>> > to access by range?
>> >
>> > I can't put all the data into a supercolumn, because it's loaded into
>> > memory
>> > at once, and it's too much data.
>> >
>> > Am I forced to use an order-preserving partitioner? I don't want the
>> > headache. Is there any other way?
>> >
>
>

Re: Giant sets of ordered data

Reply via email to