Re: Beginner Assumptions

Torsten Curdt Mon, 14 Jun 2010 04:41:40 -0700

>> <rant>
>> TBH while we are using super columns, the somehow feel wrong to me. I
>> would be happier if we could move what we do with super columns into
>> the row key space. But in our case that does not seem to be so easy.
>> </rant>
>>
>
> I'd be quite interested to learn what you are doing with super columns
> that cannot be replicated with composite keys and range queries.


We are storing events as they come in. The timestamp is the key:

 2010-10-01 14:35 event1 someattr1=someval
 2010-10-01 14:35 event1 someattr1=someval
 2010-10-01 14:36 event1 someattr1=someval

We need to access them in time buckets/groups. For example "all events
that happened at  2010-10-01 14:35". Now I see the following options:

1) Store the events in a normal column and use a range query on the row key

 2010-10-01 14:35/UUID: event1 someattr1=someval
 2010-10-01 14:35/UUID: event1 someattr1=someval
 2010-10-01 14:36/UUID: event1 someattr1=someval

 Access: range("2010-10-01 14:35".."2010-10-01 14:36")

 Problem: For the range query on the row key it needs to use the
OrdererdPartitioner ...which leads too hot spots as this is timeline
data. The hot spot would just cycle through the ring.

2) Store the events in a normal column and update an index

 2010-10-01 14:35/UUID1: event1 someattr1=someval
 2010-10-01 14:35/UUID2: event1 someattr1=someval
 2010-10-01 14:36/UUID3: event1 someattr1=someval

 2010-10-01 14:35: [ UUID1, UUID2 ]
 2010-10-01 14:36: [ UUID3 ]

 Access: read the index and then read the rows for all the events in that bucket

Problem: The index need to maintained in an atomic fashion ...so a
JSON blob is not a great idea probably. Could probably be implemented
by using the UUIDs as column names instead. That could to lead to way
more column names than one should use though. (10000-100000 column
names are not a great idea IIUC)

3) Store per event type and use the time as the column name

 event1: {
   2010-10-01 14:35/UUID1: event1 someattr1=someval
   2010-10-01 14:35/UUID2: event1 someattr1=someval
   2010-10-01 14:36/UUID3: event1 someattr1=someval
 }

 event2: {
 }

 Access: For all event type use a slice("2010-10-01 14:35".."2010-10-01 14:36")
 Problem: Storing per event type is not natural for our application.
Plus it requires a request per type. Also a lot of column names.
Cassandra scales better on the row level.

4) use a super column

 2010-10-01 14:35: {
   UUID1: event1 someattr1=someval
   UUID2: event1 someattr1=someval
 }

2010-10-01 14:36: {
   UUID3: event1 someattr1=someval
}

 Access: Just a single get request for the bucket (or page through if
too many results)
 Problem: Also has many super column names but this is a native
Cassandra primitive so one assume this is optimized ...or will become
more optimized. (I am wondering though: Do super columns reside on a
single node? I hope not)

So what would you pick then?
cheers
--
Torsten

Re: Beginner Assumptions

Reply via email to