Re: Effective partition key for time series data, which allows range queries?

Jim Ancona Tue, 04 Apr 2017 13:59:31 -0700

The typical recommendation for maximum partition size is on the order of
100mb and/or 100,000 rows. That's not a hard limit, but you may be setting
yourself up for issues as you approach or exceed those numbers.

If you need to reduce partition size, the typical way to do this is by
"bucketing," that is adding a synthetic column to the partition key to
separate the data into separate buckets. In your example above, I assume
that your current primary key is (user, timestamp), where user is the
partition key and timestamp is the clustering key. Say that you want to
store a maximum of a years worth of data in a partition. You would make the
primary key be ((user, year), timestamp). The partition key is now (user,
year) where year is the year part of the timestamp. Now if you want to
query the data for last month, you would do:

select * from data where user_id = 'foo' and year = 2017 and timestamp >=
'<1 month ago>' and timestamp <= '<today>'

If you wanted to get the data for the last 6 month, you'd do something
like:

select * from data where user_id = 'foo' and year in (2016, 2017) and
timestamp >= '<6 months ago>' and timestamp <= '<today>'  (Notice that
because the query spans two years, you have to include both years in the
select criteria so that C* knows which partitions to query. )

You can make the buckets smaller (e.g. weeks, days, hours instead of
years), but of course querying multiple buckets is less efficient, so it's
worth making your buckets as large as you can without making them too big.

Hope this helps!

Jim

On Mon, Mar 27, 2017 at 8:47 PM, Ali Akhtar <ali.rac...@gmail.com> wrote:

> I have a use case where the data for individual users is being tracked,
> and every 15 minutes or so, the data for the past 15 minutes is inserted
> into the table.
>
> The table schema looks like:
> user id, timestamp, foo, bar, etc.
>
> Where foo, bar, etc are the items being tracked, and their values over the
> past 15 minutes.
>
> I initially planned to use the user id as the primary key of the table.
> But, I realized that this may cause really wide rows ( tracking for 24
> hours means 96 records inserted (1 for each 15 min window), over 1 year
> this means 36k records per user, over 2 years, 72k, etc).
>
> I know the  limit of wide rows is billions of records, but I've heard that
> the practical limit is much lower.
>
> So I considered using a composite primary key: (user, timestamp)
>
> If I'm correct, the above should create a new row for each user &
> timestamp logged.
>
> However, will i still be able to do range queries on the timestamp, to e.g
> return the data for the last week?
>
> E.g select * from data where user_id = 'foo' and timestamp >= '<1 month
> ago>' and timestamp <= '<today>' ?
>
>

Re: Effective partition key for time series data, which allows range queries?

Reply via email to