Re: Giant sets of ordered data

Jonathan Shook Wed, 02 Jun 2010 11:31:57 -0700

Insert "if you want to use long values for keys and column names"
above paragraph 2. I forgot that part.


On Wed, Jun 2, 2010 at 1:29 PM, Jonathan Shook <jsh...@gmail.com> wrote:
> If you want to do range queries on the keys, you can use OPP to do this:
> (example using UTF-8 lexicographic keys, with bursts split across rows
> according to row size limits)
>
> Events: {
>  "20100601.05.30.003": {
>    "20100601.05.30.003": <value>
>    "20100601.05.30.007": <value>
>    ...
>  }
> }
>
> With a future version of Cassandra, you may be able to use the same
> basic datatype for both key and column name, as keys will be binary
> like the rest, I believe.
>
> I'm not aware of specific performance improvements when using OPP
> range queries on keys vs iterating over known keys. I suspect (hope)
> that round-tripping to the server should be reduced, which may be
> significant. Does anybody have decent benchmarks that tell the
> difference?
>
>
> On Wed, Jun 2, 2010 at 11:53 AM, Ben Browning <ben...@gmail.com> wrote:
>> With a traffic pattern like that, you may be better off storing the
>> events of each burst (I'll call them group) in one or more keys and
>> then storing these keys in the day key.
>>
>> EventGroupsPerDay: {
>>  "20100601": {
>>    123456789: "group123", // column name is timestamp group was
>> received, column value is key
>>    123456790: "group124"
>>  }
>> }
>>
>> EventGroups: {
>>  "group123": {
>>    123456789: "value1",
>>    123456799: "value2"
>>   }
>> }
>>
>> If you think of Cassandra as a toolkit for building scalable indexes
>> it seems to make the modeling a bit easier. In this case, you're
>> building an index by day to lookup events that come in as groups. So,
>> first you'd fetch the slice of columns for the day you're interested
>> in to figure out which groups to look at then you'd fetch the events
>> in those groups.
>>
>> There are plenty of alternate ways to divide up the data among rows
>> also - you could use hour keys instead of days as an example.
>>
>> On Wed, Jun 2, 2010 at 11:57 AM, David Boxenhorn <da...@lookin2.com> wrote:
>>> Let's say you're logging events, and you have billions of events. What if
>>> the events come in bursts, so within a day there are millions of events, but
>>> they all come within microseconds of each other a few times a day? How do
>>> you find the events that happened on a particular day if you can't store
>>> them all in one row?
>>>
>>> On Wed, Jun 2, 2010 at 6:45 PM, Jonathan Shook <jsh...@gmail.com> wrote:
>>>>
>>>> Either OPP by key, or within a row by column name. I'd suggest the latter.
>>>> If you have structured data to stick under a column (named by the
>>>> timestamp), then you can serialize and unserialize it yourself, or you
>>>> can use a supercolumn. It's effectively the same thing.  Cassandra
>>>> only provides the super column support as a convenience layer as it is
>>>> currently implemented. That may change in the future.
>>>>
>>>> You didn't make clear in your question why a standard column would be
>>>> less suitable. I presumed you had layered structure within the
>>>> timestamp, hence my response.
>>>> How would you logically partition your dataset according to natural
>>>> application boundaries? This will answer most of your question.
>>>> If you have a dataset which can't be partitioned into a reasonable
>>>> size row, then you may want to use OPP and key concatenation.
>>>>
>>>> What do you mean by giant?
>>>>
>>>> On Wed, Jun 2, 2010 at 10:32 AM, David Boxenhorn <da...@lookin2.com>
>>>> wrote:
>>>> > How do I handle giant sets of ordered data, e.g. by timestamps, which I
>>>> > want
>>>> > to access by range?
>>>> >
>>>> > I can't put all the data into a supercolumn, because it's loaded into
>>>> > memory
>>>> > at once, and it's too much data.
>>>> >
>>>> > Am I forced to use an order-preserving partitioner? I don't want the
>>>> > headache. Is there any other way?
>>>> >
>>>
>>>
>>
>

Re: Giant sets of ordered data

Reply via email to