1. Don't make your partition unbound. It's tempting to just use (device_id,
timestamp). But soon or later you will have problem when time goes by. You
can keep the partition bound by using (device_id, bucket, timestamp). Use
hour, day, month or even year like Jack mentioned depending on the size of
data.

2. As to your specific query, for a given partition and a time range, C*
doesn't need to load the whole partition then filter. It only retrieves the
slice within the time range from disk because the data is clustered by
timestamp.

On Mon, Nov 9, 2015 at 8:13 AM, Jack Krupansky <jack.krupan...@gmail.com>
wrote:

> The general rule in Cassandra data modeling is to look at all of your
> queries first and then to declare a table for each query, even if that
> means storing multiple copies of the data. So, create a second table with
> bucketed time as the partition key (hour, 15 minutes, or whatever time
> interval makes sense to give 1 to 10 megabytes per partition) and time and
> device as the clustering keys.
>
> Or, consider DSE SEarch  and then you can do whatever ad hoc queries you
> want using Solr. Or Stratio or TupleJump Stargate for an open source Lucene
> plugin.
>
> -- Jack Krupansky
>
> On Mon, Nov 9, 2015 at 8:05 AM, Guillaume Charhon <
> guilla...@databerries.com> wrote:
>
>> Hello,
>>
>> We are currently storing geolocation events (about 1 per 5 minutes) for
>> each device we track. We currently have 2 TB of data. I would like to store
>> the device_id, the timestamp of the event, latitude and longitude. I though
>> about using the device_id as the partition key and timestamp as the
>> clustering column. It is great as events are naturally grouped by device
>> (very useful for our Spark jobs). However, if I would like to retrieve all
>> events of all devices of the last week I understood that Cassandra will
>> need to load all data and filter which does not seems to be clean on the
>> long term.
>>
>> How should I create my model?
>>
>> Best Regards
>>
>
>

Reply via email to