it depends on the size of each event. You want to bound each partition
under ~10MB. In system.log look for entry like:

WARN  [CompactionExecutor:39] 2015-11-07 17:32:00,019
SSTableWriter.java:240 - Compacting large partition
xxxx:9f80ce31-b7e7-40c7-b642-f5d03fc320aa (13443863224 bytes)

This is the warning sign that you have large partitions. The threshold is
defined by compaction_large_partition_warning_threshold_mb in
cassandra.yaml. The default is 100MB.

You can also use nodetool cfstats to check partition size.

On Mon, Nov 9, 2015 at 10:53 AM, Guillaume Charhon <
guilla...@databerries.com> wrote:

> For the first table: (device_id, timestamp), should I add a bucket even
> if I know I might have millions of events per device but never billions?
>
> On Mon, Nov 9, 2015 at 4:37 PM, Jack Krupansky <jack.krupan...@gmail.com>
> wrote:
>
>> Cassandra is good at two kinds of queries: 1) access a specific row by a
>> specific key, and 2) Access a slice or consecutive sequence of rows within
>> a given partition.
>>
>> It is recommended to avoid ALLOW FILTERING. If it happens to work well
>> for you, great, go for it, but if it doesn't then simply don't do it. Best
>> to redesign your data model to play to Cassandra's strengths.
>>
>> If you bucket the time-based table, do a separate query for each time
>> bucket.
>>
>> -- Jack Krupansky
>>
>> On Mon, Nov 9, 2015 at 10:16 AM, Guillaume Charhon <
>> guilla...@databerries.com> wrote:
>>
>>> Kai, Jack,
>>>
>>> On 1., should the bucket be a STRING with a date format or do I have a
>>> better option ? For (device_id, bucket, timestamp), did you mean
>>> ((device_id, bucket), timestamp) ?
>>>
>>> On 2., what are the risks of timeout ? I currently have this warning:
>>> "Cannot execute this query as it might involve data filtering and thus may
>>> have unpredictable performance. If you want to execute this query despite
>>> the performance unpredictability, use ALLOW FILTERING".
>>>
>>> On Mon, Nov 9, 2015 at 3:02 PM, Kai Wang <dep...@gmail.com> wrote:
>>>
>>>> 1. Don't make your partition unbound. It's tempting to just use
>>>> (device_id, timestamp). But soon or later you will have problem when time
>>>> goes by. You can keep the partition bound by using (device_id, bucket,
>>>> timestamp). Use hour, day, month or even year like Jack mentioned depending
>>>> on the size of data.
>>>>
>>>> 2. As to your specific query, for a given partition and a time range,
>>>> C* doesn't need to load the whole partition then filter. It only retrieves
>>>> the slice within the time range from disk because the data is clustered by
>>>> timestamp.
>>>>
>>>> On Mon, Nov 9, 2015 at 8:13 AM, Jack Krupansky <
>>>> jack.krupan...@gmail.com> wrote:
>>>>
>>>>> The general rule in Cassandra data modeling is to look at all of your
>>>>> queries first and then to declare a table for each query, even if that
>>>>> means storing multiple copies of the data. So, create a second table with
>>>>> bucketed time as the partition key (hour, 15 minutes, or whatever time
>>>>> interval makes sense to give 1 to 10 megabytes per partition) and time and
>>>>> device as the clustering keys.
>>>>>
>>>>> Or, consider DSE SEarch  and then you can do whatever ad hoc queries
>>>>> you want using Solr. Or Stratio or TupleJump Stargate for an open source
>>>>> Lucene plugin.
>>>>>
>>>>> -- Jack Krupansky
>>>>>
>>>>> On Mon, Nov 9, 2015 at 8:05 AM, Guillaume Charhon <
>>>>> guilla...@databerries.com> wrote:
>>>>>
>>>>>> Hello,
>>>>>>
>>>>>> We are currently storing geolocation events (about 1 per 5 minutes)
>>>>>> for each device we track. We currently have 2 TB of data. I would like to
>>>>>> store the device_id, the timestamp of the event, latitude and longitude. 
>>>>>> I
>>>>>> though about using the device_id as the partition key and timestamp as 
>>>>>> the
>>>>>> clustering column. It is great as events are naturally grouped by device
>>>>>> (very useful for our Spark jobs). However, if I would like to retrieve 
>>>>>> all
>>>>>> events of all devices of the last week I understood that Cassandra will
>>>>>> need to load all data and filter which does not seems to be clean on the
>>>>>> long term.
>>>>>>
>>>>>> How should I create my model?
>>>>>>
>>>>>> Best Regards
>>>>>>
>>>>>
>>>>>
>>>>
>>>
>>
>

Reply via email to