data distribution along column family partitions

Marcelo Elias Del Valle Wed, 04 Feb 2015 02:38:26 -0800

Hello,

I am designing a model to store alerts users receive over time. I will want
to store probably the last two years of alerts for each user.


The first thought I had was having a column family partitioned by user +
timebucket, where time bucket could be something like year + month. For
instance:

*partition key:*
user-id = f47ac10b-58cc-*4*372-*a*567-0e02b2c3d479
time-bucket = 201502
*rest of primary key:*
timestamp = column of tipy timestamp
alert id = f47ac10b-58cc-*4*372-*a*567-0e02b2c3d480

Question, would this make it easier to delete old data? Supposing I am not
using TTL and I want to remove alerts older than 2 years, what would be
better, just deleting the entire time-bucket for each user-id (through a
map/reduce process) or having just user-id as partition key and deleting,
for each user, where X > timestamp > Y?

Is it the same for Cassandra, internally?

Another question is: would data be distributed enough if I just choose to
partition by user-id? I will have some users with a large number of alerts,
but in average I could consider alerts would have a good distribution along
user ids. The problem is I don't fell confident having few partitions with
A LOT of alerts would not be a problem at read time.

What happens at read time when I try to read data from a big partition?
Like, I want to read alerts for a user where X > timestamp > Y, but it
would return 1 million alerts. As it's all in a single partition, this read
will occur in the same node, thus allocating a lot of memory for this
single operation, right?

What if the memory needed for this operation is bigger than it fits in java
heap? Would this be a problem to Cassandra?


Best regards,
-- 
Marcelo Elias Del Valle
http://mvalle.com - @mvallebr

data distribution along column family partitions

Reply via email to