Is it usually recommended to use the bucket key (usually an 5 minutes
period in my case) for the table of the events_by_time using a timestamp or
a string?

On Mon, Nov 9, 2015 at 5:05 PM, Kai Wang <dep...@gmail.com> wrote:

> it depends on the size of each event. You want to bound each partition
> under ~10MB. In system.log look for entry like:
>
> WARN  [CompactionExecutor:39] 2015-11-07 17:32:00,019
> SSTableWriter.java:240 - Compacting large partition
> xxxx:9f80ce31-b7e7-40c7-b642-f5d03fc320aa (13443863224 bytes)
>
> This is the warning sign that you have large partitions. The threshold is
> defined by compaction_large_partition_warning_threshold_mb in
> cassandra.yaml. The default is 100MB.
>
> You can also use nodetool cfstats to check partition size.
>
> On Mon, Nov 9, 2015 at 10:53 AM, Guillaume Charhon <
> guilla...@databerries.com> wrote:
>
>> For the first table: (device_id, timestamp), should I add a bucket even
>> if I know I might have millions of events per device but never billions?
>>
>> On Mon, Nov 9, 2015 at 4:37 PM, Jack Krupansky <jack.krupan...@gmail.com>
>> wrote:
>>
>>> Cassandra is good at two kinds of queries: 1) access a specific row by a
>>> specific key, and 2) Access a slice or consecutive sequence of rows within
>>> a given partition.
>>>
>>> It is recommended to avoid ALLOW FILTERING. If it happens to work well
>>> for you, great, go for it, but if it doesn't then simply don't do it. Best
>>> to redesign your data model to play to Cassandra's strengths.
>>>
>>> If you bucket the time-based table, do a separate query for each time
>>> bucket.
>>>
>>> -- Jack Krupansky
>>>
>>> On Mon, Nov 9, 2015 at 10:16 AM, Guillaume Charhon <
>>> guilla...@databerries.com> wrote:
>>>
>>>> Kai, Jack,
>>>>
>>>> On 1., should the bucket be a STRING with a date format or do I have a
>>>> better option ? For (device_id, bucket, timestamp), did you mean
>>>> ((device_id, bucket), timestamp) ?
>>>>
>>>> On 2., what are the risks of timeout ? I currently have this warning:
>>>> "Cannot execute this query as it might involve data filtering and thus may
>>>> have unpredictable performance. If you want to execute this query despite
>>>> the performance unpredictability, use ALLOW FILTERING".
>>>>
>>>> On Mon, Nov 9, 2015 at 3:02 PM, Kai Wang <dep...@gmail.com> wrote:
>>>>
>>>>> 1. Don't make your partition unbound. It's tempting to just use
>>>>> (device_id, timestamp). But soon or later you will have problem when time
>>>>> goes by. You can keep the partition bound by using (device_id, bucket,
>>>>> timestamp). Use hour, day, month or even year like Jack mentioned 
>>>>> depending
>>>>> on the size of data.
>>>>>
>>>>> 2. As to your specific query, for a given partition and a time range,
>>>>> C* doesn't need to load the whole partition then filter. It only retrieves
>>>>> the slice within the time range from disk because the data is clustered by
>>>>> timestamp.
>>>>>
>>>>> On Mon, Nov 9, 2015 at 8:13 AM, Jack Krupansky <
>>>>> jack.krupan...@gmail.com> wrote:
>>>>>
>>>>>> The general rule in Cassandra data modeling is to look at all of your
>>>>>> queries first and then to declare a table for each query, even if that
>>>>>> means storing multiple copies of the data. So, create a second table with
>>>>>> bucketed time as the partition key (hour, 15 minutes, or whatever time
>>>>>> interval makes sense to give 1 to 10 megabytes per partition) and time 
>>>>>> and
>>>>>> device as the clustering keys.
>>>>>>
>>>>>> Or, consider DSE SEarch  and then you can do whatever ad hoc queries
>>>>>> you want using Solr. Or Stratio or TupleJump Stargate for an open source
>>>>>> Lucene plugin.
>>>>>>
>>>>>> -- Jack Krupansky
>>>>>>
>>>>>> On Mon, Nov 9, 2015 at 8:05 AM, Guillaume Charhon <
>>>>>> guilla...@databerries.com> wrote:
>>>>>>
>>>>>>> Hello,
>>>>>>>
>>>>>>> We are currently storing geolocation events (about 1 per 5 minutes)
>>>>>>> for each device we track. We currently have 2 TB of data. I would like 
>>>>>>> to
>>>>>>> store the device_id, the timestamp of the event, latitude and 
>>>>>>> longitude. I
>>>>>>> though about using the device_id as the partition key and timestamp as 
>>>>>>> the
>>>>>>> clustering column. It is great as events are naturally grouped by device
>>>>>>> (very useful for our Spark jobs). However, if I would like to retrieve 
>>>>>>> all
>>>>>>> events of all devices of the last week I understood that Cassandra will
>>>>>>> need to load all data and filter which does not seems to be clean on the
>>>>>>> long term.
>>>>>>>
>>>>>>> How should I create my model?
>>>>>>>
>>>>>>> Best Regards
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Reply via email to