Re: data distribution along column family partitions

Chris Lohfink Wed, 04 Feb 2015 07:20:44 -0800

The data model lgtm.  You may need to balance the size of the time buckets
with the amount of alarms to prevent partitions from getting too large.  1
month may be a little large, I would aim to keep the partitions below 25mb
(can check with nodetool cfstats) or so in size to keep everything happy.
Its ok if occasional ones go larger, something like 1gb can be bad.. but it
would still work if not very efficiently.


Deletes on an entire time-bucket at a time seems like a good approach, but
just setting TTL would be far far better imho (why not just set it to two
years?).  May want to look into new DateTieredCompactionStrategy, or
LeveledCompactionStrategy or the obsoleted data will very rarely go away.

When reading just be sure to use paging (the good cql drivers will have it
built in) and don't actually read it all in one massive query.  If you
decrease size of your time bucket you may end up having to page the query
across multiple partitions if Y-X > bucket size.

Chris

On Wed, Feb 4, 2015 at 4:34 AM, Marcelo Elias Del Valle <mvall...@gmail.com>
 wrote:

> Hello,
>
> I am designing a model to store alerts users receive over time. I will
> want to store probably the last two years of alerts for each user.
>
> The first thought I had was having a column family partitioned by user +
> timebucket, where time bucket could be something like year + month. For
> instance:
>
> *partition key:*
> user-id = f47ac10b-58cc-*4*372-*a*567-0e02b2c3d479
> time-bucket = 201502
> *rest of primary key:*
> timestamp = column of tipy timestamp
> alert id = f47ac10b-58cc-*4*372-*a*567-0e02b2c3d480
>
> Question, would this make it easier to delete old data? Supposing I am not
> using TTL and I want to remove alerts older than 2 years, what would be
> better, just deleting the entire time-bucket for each user-id (through a
> map/reduce process) or having just user-id as partition key and deleting,
> for each user, where X > timestamp > Y?
>
> Is it the same for Cassandra, internally?
>
> Another question is: would data be distributed enough if I just choose to
> partition by user-id? I will have some users with a large number of alerts,
> but in average I could consider alerts would have a good distribution along
> user ids. The problem is I don't fell confident having few partitions with
> A LOT of alerts would not be a problem at read time.
>
> What happens at read time when I try to read data from a big partition?
> Like, I want to read alerts for a user where X > timestamp > Y, but it
> would return 1 million alerts. As it's all in a single partition, this read
> will occur in the same node, thus allocating a lot of memory for this
> single operation, right?
>
> What if the memory needed for this operation is bigger than it fits in
> java heap? Would this be a problem to Cassandra?
>
>
> Best regards,
> --
> Marcelo Elias Del Valle
> http://mvalle.com - @mvallebr
>
>

Re: data distribution along column family partitions

Reply via email to