Re: How to organize a timeseries by device?

Kai Wang Mon, 09 Nov 2015 10:58:54 -0800

bucket key is just like any column of the table, you can use any type as
long as it's convenient for you to write the query.


But I don't think you should use 5 minute as your bucket key since you only
have 1 event every 5 minute. 5-minute bucket seems too small. The bucket
key we mentioned is for you to break the (device_id, timestamp) partitions
into ones with size between ~1MB to ~10MB.

On Mon, Nov 9, 2015 at 11:50 AM, Guillaume Charhon <
guilla...@databerries.com> wrote:

> Is it usually recommended to use the bucket key (usually an 5 minutes
> period in my case) for the table of the events_by_time using a timestamp or
> a string?
>
> On Mon, Nov 9, 2015 at 5:05 PM, Kai Wang <dep...@gmail.com> wrote:
>
>> it depends on the size of each event. You want to bound each partition
>> under ~10MB. In system.log look for entry like:
>>
>> WARN  [CompactionExecutor:39] 2015-11-07 17:32:00,019
>> SSTableWriter.java:240 - Compacting large partition
>> xxxx:9f80ce31-b7e7-40c7-b642-f5d03fc320aa (13443863224 bytes)
>>
>> This is the warning sign that you have large partitions. The threshold is
>> defined by compaction_large_partition_warning_threshold_mb in
>> cassandra.yaml. The default is 100MB.
>>
>> You can also use nodetool cfstats to check partition size.
>>
>> On Mon, Nov 9, 2015 at 10:53 AM, Guillaume Charhon <
>> guilla...@databerries.com> wrote:
>>
>>> For the first table: (device_id, timestamp), should I add a bucket even
>>> if I know I might have millions of events per device but never billions?
>>>
>>> On Mon, Nov 9, 2015 at 4:37 PM, Jack Krupansky <jack.krupan...@gmail.com
>>> > wrote:
>>>
>>>> Cassandra is good at two kinds of queries: 1) access a specific row by
>>>> a specific key, and 2) Access a slice or consecutive sequence of rows
>>>> within a given partition.
>>>>
>>>> It is recommended to avoid ALLOW FILTERING. If it happens to work well
>>>> for you, great, go for it, but if it doesn't then simply don't do it. Best
>>>> to redesign your data model to play to Cassandra's strengths.
>>>>
>>>> If you bucket the time-based table, do a separate query for each time
>>>> bucket.
>>>>
>>>> -- Jack Krupansky
>>>>
>>>> On Mon, Nov 9, 2015 at 10:16 AM, Guillaume Charhon <
>>>> guilla...@databerries.com> wrote:
>>>>
>>>>> Kai, Jack,
>>>>>
>>>>> On 1., should the bucket be a STRING with a date format or do I have a
>>>>> better option ? For (device_id, bucket, timestamp), did you mean
>>>>> ((device_id, bucket), timestamp) ?
>>>>>
>>>>> On 2., what are the risks of timeout ? I currently have this warning:
>>>>> "Cannot execute this query as it might involve data filtering and thus may
>>>>> have unpredictable performance. If you want to execute this query despite
>>>>> the performance unpredictability, use ALLOW FILTERING".
>>>>>
>>>>> On Mon, Nov 9, 2015 at 3:02 PM, Kai Wang <dep...@gmail.com> wrote:
>>>>>
>>>>>> 1. Don't make your partition unbound. It's tempting to just use
>>>>>> (device_id, timestamp). But soon or later you will have problem when time
>>>>>> goes by. You can keep the partition bound by using (device_id, bucket,
>>>>>> timestamp). Use hour, day, month or even year like Jack mentioned 
>>>>>> depending
>>>>>> on the size of data.
>>>>>>
>>>>>> 2. As to your specific query, for a given partition and a time range,
>>>>>> C* doesn't need to load the whole partition then filter. It only 
>>>>>> retrieves
>>>>>> the slice within the time range from disk because the data is clustered 
>>>>>> by
>>>>>> timestamp.
>>>>>>
>>>>>> On Mon, Nov 9, 2015 at 8:13 AM, Jack Krupansky <
>>>>>> jack.krupan...@gmail.com> wrote:
>>>>>>
>>>>>>> The general rule in Cassandra data modeling is to look at all of
>>>>>>> your queries first and then to declare a table for each query, even if 
>>>>>>> that
>>>>>>> means storing multiple copies of the data. So, create a second table 
>>>>>>> with
>>>>>>> bucketed time as the partition key (hour, 15 minutes, or whatever time
>>>>>>> interval makes sense to give 1 to 10 megabytes per partition) and time 
>>>>>>> and
>>>>>>> device as the clustering keys.
>>>>>>>
>>>>>>> Or, consider DSE SEarch  and then you can do whatever ad hoc queries
>>>>>>> you want using Solr. Or Stratio or TupleJump Stargate for an open source
>>>>>>> Lucene plugin.
>>>>>>>
>>>>>>> -- Jack Krupansky
>>>>>>>
>>>>>>> On Mon, Nov 9, 2015 at 8:05 AM, Guillaume Charhon <
>>>>>>> guilla...@databerries.com> wrote:
>>>>>>>
>>>>>>>> Hello,
>>>>>>>>
>>>>>>>> We are currently storing geolocation events (about 1 per 5 minutes)
>>>>>>>> for each device we track. We currently have 2 TB of data. I would like 
>>>>>>>> to
>>>>>>>> store the device_id, the timestamp of the event, latitude and 
>>>>>>>> longitude. I
>>>>>>>> though about using the device_id as the partition key and timestamp as 
>>>>>>>> the
>>>>>>>> clustering column. It is great as events are naturally grouped by 
>>>>>>>> device
>>>>>>>> (very useful for our Spark jobs). However, if I would like to retrieve 
>>>>>>>> all
>>>>>>>> events of all devices of the last week I understood that Cassandra will
>>>>>>>> need to load all data and filter which does not seems to be clean on 
>>>>>>>> the
>>>>>>>> long term.
>>>>>>>>
>>>>>>>> How should I create my model?
>>>>>>>>
>>>>>>>> Best Regards
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: How to organize a timeseries by device?

Reply via email to