For the first table: (device_id, timestamp), should I add a bucket even if
I know I might have millions of events per device but never billions?

On Mon, Nov 9, 2015 at 4:37 PM, Jack Krupansky <jack.krupan...@gmail.com>
wrote:

> Cassandra is good at two kinds of queries: 1) access a specific row by a
> specific key, and 2) Access a slice or consecutive sequence of rows within
> a given partition.
>
> It is recommended to avoid ALLOW FILTERING. If it happens to work well for
> you, great, go for it, but if it doesn't then simply don't do it. Best to
> redesign your data model to play to Cassandra's strengths.
>
> If you bucket the time-based table, do a separate query for each time
> bucket.
>
> -- Jack Krupansky
>
> On Mon, Nov 9, 2015 at 10:16 AM, Guillaume Charhon <
> guilla...@databerries.com> wrote:
>
>> Kai, Jack,
>>
>> On 1., should the bucket be a STRING with a date format or do I have a
>> better option ? For (device_id, bucket, timestamp), did you mean
>> ((device_id, bucket), timestamp) ?
>>
>> On 2., what are the risks of timeout ? I currently have this warning:
>> "Cannot execute this query as it might involve data filtering and thus may
>> have unpredictable performance. If you want to execute this query despite
>> the performance unpredictability, use ALLOW FILTERING".
>>
>> On Mon, Nov 9, 2015 at 3:02 PM, Kai Wang <dep...@gmail.com> wrote:
>>
>>> 1. Don't make your partition unbound. It's tempting to just use
>>> (device_id, timestamp). But soon or later you will have problem when time
>>> goes by. You can keep the partition bound by using (device_id, bucket,
>>> timestamp). Use hour, day, month or even year like Jack mentioned depending
>>> on the size of data.
>>>
>>> 2. As to your specific query, for a given partition and a time range, C*
>>> doesn't need to load the whole partition then filter. It only retrieves the
>>> slice within the time range from disk because the data is clustered by
>>> timestamp.
>>>
>>> On Mon, Nov 9, 2015 at 8:13 AM, Jack Krupansky <jack.krupan...@gmail.com
>>> > wrote:
>>>
>>>> The general rule in Cassandra data modeling is to look at all of your
>>>> queries first and then to declare a table for each query, even if that
>>>> means storing multiple copies of the data. So, create a second table with
>>>> bucketed time as the partition key (hour, 15 minutes, or whatever time
>>>> interval makes sense to give 1 to 10 megabytes per partition) and time and
>>>> device as the clustering keys.
>>>>
>>>> Or, consider DSE SEarch  and then you can do whatever ad hoc queries
>>>> you want using Solr. Or Stratio or TupleJump Stargate for an open source
>>>> Lucene plugin.
>>>>
>>>> -- Jack Krupansky
>>>>
>>>> On Mon, Nov 9, 2015 at 8:05 AM, Guillaume Charhon <
>>>> guilla...@databerries.com> wrote:
>>>>
>>>>> Hello,
>>>>>
>>>>> We are currently storing geolocation events (about 1 per 5 minutes)
>>>>> for each device we track. We currently have 2 TB of data. I would like to
>>>>> store the device_id, the timestamp of the event, latitude and longitude. I
>>>>> though about using the device_id as the partition key and timestamp as the
>>>>> clustering column. It is great as events are naturally grouped by device
>>>>> (very useful for our Spark jobs). However, if I would like to retrieve all
>>>>> events of all devices of the last week I understood that Cassandra will
>>>>> need to load all data and filter which does not seems to be clean on the
>>>>> long term.
>>>>>
>>>>> How should I create my model?
>>>>>
>>>>> Best Regards
>>>>>
>>>>
>>>>
>>>
>>
>

Reply via email to