Re: How to organize a timeseries by device?

Guillaume Charhon Tue, 10 Nov 2015 01:06:20 -0800

Thank you for all theses details. I will do a monthly bucket.

On Mon, Nov 9, 2015 at 7:58 PM, Kai Wang <dep...@gmail.com> wrote:


> bucket key is just like any column of the table, you can use any type as
> long as it's convenient for you to write the query.
>
> But I don't think you should use 5 minute as your bucket key since you
> only have 1 event every 5 minute. 5-minute bucket seems too small. The
> bucket key we mentioned is for you to break the (device_id, timestamp)
> partitions into ones with size between ~1MB to ~10MB.
>
> On Mon, Nov 9, 2015 at 11:50 AM, Guillaume Charhon <
> guilla...@databerries.com> wrote:
>
>> Is it usually recommended to use the bucket key (usually an 5 minutes
>> period in my case) for the table of the events_by_time using a timestamp or
>> a string?
>>
>> On Mon, Nov 9, 2015 at 5:05 PM, Kai Wang <dep...@gmail.com> wrote:
>>
>>> it depends on the size of each event. You want to bound each partition
>>> under ~10MB. In system.log look for entry like:
>>>
>>> WARN  [CompactionExecutor:39] 2015-11-07 17:32:00,019
>>> SSTableWriter.java:240 - Compacting large partition
>>> xxxx:9f80ce31-b7e7-40c7-b642-f5d03fc320aa (13443863224 bytes)
>>>
>>> This is the warning sign that you have large partitions. The threshold
>>> is defined by compaction_large_partition_warning_threshold_mb in
>>> cassandra.yaml. The default is 100MB.
>>>
>>> You can also use nodetool cfstats to check partition size.
>>>
>>> On Mon, Nov 9, 2015 at 10:53 AM, Guillaume Charhon <
>>> guilla...@databerries.com> wrote:
>>>
>>>> For the first table: (device_id, timestamp), should I add a bucket
>>>> even if I know I might have millions of events per device but never
>>>> billions?
>>>>
>>>> On Mon, Nov 9, 2015 at 4:37 PM, Jack Krupansky <
>>>> jack.krupan...@gmail.com> wrote:
>>>>
>>>>> Cassandra is good at two kinds of queries: 1) access a specific row by
>>>>> a specific key, and 2) Access a slice or consecutive sequence of rows
>>>>> within a given partition.
>>>>>
>>>>> It is recommended to avoid ALLOW FILTERING. If it happens to work well
>>>>> for you, great, go for it, but if it doesn't then simply don't do it. Best
>>>>> to redesign your data model to play to Cassandra's strengths.
>>>>>
>>>>> If you bucket the time-based table, do a separate query for each time
>>>>> bucket.
>>>>>
>>>>> -- Jack Krupansky
>>>>>
>>>>> On Mon, Nov 9, 2015 at 10:16 AM, Guillaume Charhon <
>>>>> guilla...@databerries.com> wrote:
>>>>>
>>>>>> Kai, Jack,
>>>>>>
>>>>>> On 1., should the bucket be a STRING with a date format or do I have
>>>>>> a better option ? For (device_id, bucket, timestamp), did you mean
>>>>>> ((device_id, bucket), timestamp) ?
>>>>>>
>>>>>> On 2., what are the risks of timeout ? I currently have this warning:
>>>>>> "Cannot execute this query as it might involve data filtering and thus 
>>>>>> may
>>>>>> have unpredictable performance. If you want to execute this query despite
>>>>>> the performance unpredictability, use ALLOW FILTERING".
>>>>>>
>>>>>> On Mon, Nov 9, 2015 at 3:02 PM, Kai Wang <dep...@gmail.com> wrote:
>>>>>>
>>>>>>> 1. Don't make your partition unbound. It's tempting to just use
>>>>>>> (device_id, timestamp). But soon or later you will have problem when 
>>>>>>> time
>>>>>>> goes by. You can keep the partition bound by using (device_id, bucket,
>>>>>>> timestamp). Use hour, day, month or even year like Jack mentioned 
>>>>>>> depending
>>>>>>> on the size of data.
>>>>>>>
>>>>>>> 2. As to your specific query, for a given partition and a time
>>>>>>> range, C* doesn't need to load the whole partition then filter. It only
>>>>>>> retrieves the slice within the time range from disk because the data is
>>>>>>> clustered by timestamp.
>>>>>>>
>>>>>>> On Mon, Nov 9, 2015 at 8:13 AM, Jack Krupansky <
>>>>>>> jack.krupan...@gmail.com> wrote:
>>>>>>>
>>>>>>>> The general rule in Cassandra data modeling is to look at all of
>>>>>>>> your queries first and then to declare a table for each query, even if 
>>>>>>>> that
>>>>>>>> means storing multiple copies of the data. So, create a second table 
>>>>>>>> with
>>>>>>>> bucketed time as the partition key (hour, 15 minutes, or whatever time
>>>>>>>> interval makes sense to give 1 to 10 megabytes per partition) and time 
>>>>>>>> and
>>>>>>>> device as the clustering keys.
>>>>>>>>
>>>>>>>> Or, consider DSE SEarch  and then you can do whatever ad hoc
>>>>>>>> queries you want using Solr. Or Stratio or TupleJump Stargate for an 
>>>>>>>> open
>>>>>>>> source Lucene plugin.
>>>>>>>>
>>>>>>>> -- Jack Krupansky
>>>>>>>>
>>>>>>>> On Mon, Nov 9, 2015 at 8:05 AM, Guillaume Charhon <
>>>>>>>> guilla...@databerries.com> wrote:
>>>>>>>>
>>>>>>>>> Hello,
>>>>>>>>>
>>>>>>>>> We are currently storing geolocation events (about 1 per 5
>>>>>>>>> minutes) for each device we track. We currently have 2 TB of data. I 
>>>>>>>>> would
>>>>>>>>> like to store the device_id, the timestamp of the event, latitude and
>>>>>>>>> longitude. I though about using the device_id as the partition key and
>>>>>>>>> timestamp as the clustering column. It is great as events are 
>>>>>>>>> naturally
>>>>>>>>> grouped by device (very useful for our Spark jobs). However, if I 
>>>>>>>>> would
>>>>>>>>> like to retrieve all events of all devices of the last week I 
>>>>>>>>> understood
>>>>>>>>> that Cassandra will need to load all data and filter which does not 
>>>>>>>>> seems
>>>>>>>>> to be clean on the long term.
>>>>>>>>>
>>>>>>>>> How should I create my model?
>>>>>>>>>
>>>>>>>>> Best Regards
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: How to organize a timeseries by device?

Reply via email to