Hello, I am designing a model to store alerts users receive over time. I will want to store probably the last two years of alerts for each user.
The first thought I had was having a column family partitioned by user + timebucket, where time bucket could be something like year + month. For instance: *partition key:* user-id = f47ac10b-58cc-*4*372-*a*567-0e02b2c3d479 time-bucket = 201502 *rest of primary key:* timestamp = column of tipy timestamp alert id = f47ac10b-58cc-*4*372-*a*567-0e02b2c3d480 Question, would this make it easier to delete old data? Supposing I am not using TTL and I want to remove alerts older than 2 years, what would be better, just deleting the entire time-bucket for each user-id (through a map/reduce process) or having just user-id as partition key and deleting, for each user, where X > timestamp > Y? Is it the same for Cassandra, internally? Another question is: would data be distributed enough if I just choose to partition by user-id? I will have some users with a large number of alerts, but in average I could consider alerts would have a good distribution along user ids. The problem is I don't fell confident having few partitions with A LOT of alerts would not be a problem at read time. What happens at read time when I try to read data from a big partition? Like, I want to read alerts for a user where X > timestamp > Y, but it would return 1 million alerts. As it's all in a single partition, this read will occur in the same node, thus allocating a lot of memory for this single operation, right? What if the memory needed for this operation is bigger than it fits in java heap? Would this be a problem to Cassandra? Best regards, -- Marcelo Elias Del Valle http://mvalle.com - @mvallebr