bucket key is just like any column of the table, you can use any type as long as it's convenient for you to write the query.
But I don't think you should use 5 minute as your bucket key since you only have 1 event every 5 minute. 5-minute bucket seems too small. The bucket key we mentioned is for you to break the (device_id, timestamp) partitions into ones with size between ~1MB to ~10MB. On Mon, Nov 9, 2015 at 11:50 AM, Guillaume Charhon < guilla...@databerries.com> wrote: > Is it usually recommended to use the bucket key (usually an 5 minutes > period in my case) for the table of the events_by_time using a timestamp or > a string? > > On Mon, Nov 9, 2015 at 5:05 PM, Kai Wang <dep...@gmail.com> wrote: > >> it depends on the size of each event. You want to bound each partition >> under ~10MB. In system.log look for entry like: >> >> WARN [CompactionExecutor:39] 2015-11-07 17:32:00,019 >> SSTableWriter.java:240 - Compacting large partition >> xxxx:9f80ce31-b7e7-40c7-b642-f5d03fc320aa (13443863224 bytes) >> >> This is the warning sign that you have large partitions. The threshold is >> defined by compaction_large_partition_warning_threshold_mb in >> cassandra.yaml. The default is 100MB. >> >> You can also use nodetool cfstats to check partition size. >> >> On Mon, Nov 9, 2015 at 10:53 AM, Guillaume Charhon < >> guilla...@databerries.com> wrote: >> >>> For the first table: (device_id, timestamp), should I add a bucket even >>> if I know I might have millions of events per device but never billions? >>> >>> On Mon, Nov 9, 2015 at 4:37 PM, Jack Krupansky <jack.krupan...@gmail.com >>> > wrote: >>> >>>> Cassandra is good at two kinds of queries: 1) access a specific row by >>>> a specific key, and 2) Access a slice or consecutive sequence of rows >>>> within a given partition. >>>> >>>> It is recommended to avoid ALLOW FILTERING. If it happens to work well >>>> for you, great, go for it, but if it doesn't then simply don't do it. Best >>>> to redesign your data model to play to Cassandra's strengths. >>>> >>>> If you bucket the time-based table, do a separate query for each time >>>> bucket. >>>> >>>> -- Jack Krupansky >>>> >>>> On Mon, Nov 9, 2015 at 10:16 AM, Guillaume Charhon < >>>> guilla...@databerries.com> wrote: >>>> >>>>> Kai, Jack, >>>>> >>>>> On 1., should the bucket be a STRING with a date format or do I have a >>>>> better option ? For (device_id, bucket, timestamp), did you mean >>>>> ((device_id, bucket), timestamp) ? >>>>> >>>>> On 2., what are the risks of timeout ? I currently have this warning: >>>>> "Cannot execute this query as it might involve data filtering and thus may >>>>> have unpredictable performance. If you want to execute this query despite >>>>> the performance unpredictability, use ALLOW FILTERING". >>>>> >>>>> On Mon, Nov 9, 2015 at 3:02 PM, Kai Wang <dep...@gmail.com> wrote: >>>>> >>>>>> 1. Don't make your partition unbound. It's tempting to just use >>>>>> (device_id, timestamp). But soon or later you will have problem when time >>>>>> goes by. You can keep the partition bound by using (device_id, bucket, >>>>>> timestamp). Use hour, day, month or even year like Jack mentioned >>>>>> depending >>>>>> on the size of data. >>>>>> >>>>>> 2. As to your specific query, for a given partition and a time range, >>>>>> C* doesn't need to load the whole partition then filter. It only >>>>>> retrieves >>>>>> the slice within the time range from disk because the data is clustered >>>>>> by >>>>>> timestamp. >>>>>> >>>>>> On Mon, Nov 9, 2015 at 8:13 AM, Jack Krupansky < >>>>>> jack.krupan...@gmail.com> wrote: >>>>>> >>>>>>> The general rule in Cassandra data modeling is to look at all of >>>>>>> your queries first and then to declare a table for each query, even if >>>>>>> that >>>>>>> means storing multiple copies of the data. So, create a second table >>>>>>> with >>>>>>> bucketed time as the partition key (hour, 15 minutes, or whatever time >>>>>>> interval makes sense to give 1 to 10 megabytes per partition) and time >>>>>>> and >>>>>>> device as the clustering keys. >>>>>>> >>>>>>> Or, consider DSE SEarch and then you can do whatever ad hoc queries >>>>>>> you want using Solr. Or Stratio or TupleJump Stargate for an open source >>>>>>> Lucene plugin. >>>>>>> >>>>>>> -- Jack Krupansky >>>>>>> >>>>>>> On Mon, Nov 9, 2015 at 8:05 AM, Guillaume Charhon < >>>>>>> guilla...@databerries.com> wrote: >>>>>>> >>>>>>>> Hello, >>>>>>>> >>>>>>>> We are currently storing geolocation events (about 1 per 5 minutes) >>>>>>>> for each device we track. We currently have 2 TB of data. I would like >>>>>>>> to >>>>>>>> store the device_id, the timestamp of the event, latitude and >>>>>>>> longitude. I >>>>>>>> though about using the device_id as the partition key and timestamp as >>>>>>>> the >>>>>>>> clustering column. It is great as events are naturally grouped by >>>>>>>> device >>>>>>>> (very useful for our Spark jobs). However, if I would like to retrieve >>>>>>>> all >>>>>>>> events of all devices of the last week I understood that Cassandra will >>>>>>>> need to load all data and filter which does not seems to be clean on >>>>>>>> the >>>>>>>> long term. >>>>>>>> >>>>>>>> How should I create my model? >>>>>>>> >>>>>>>> Best Regards >>>>>>>> >>>>>>> >>>>>>> >>>>>> >>>>> >>>> >>> >> >