Is it usually recommended to use the bucket key (usually an 5 minutes period in my case) for the table of the events_by_time using a timestamp or a string?
On Mon, Nov 9, 2015 at 5:05 PM, Kai Wang <dep...@gmail.com> wrote: > it depends on the size of each event. You want to bound each partition > under ~10MB. In system.log look for entry like: > > WARN [CompactionExecutor:39] 2015-11-07 17:32:00,019 > SSTableWriter.java:240 - Compacting large partition > xxxx:9f80ce31-b7e7-40c7-b642-f5d03fc320aa (13443863224 bytes) > > This is the warning sign that you have large partitions. The threshold is > defined by compaction_large_partition_warning_threshold_mb in > cassandra.yaml. The default is 100MB. > > You can also use nodetool cfstats to check partition size. > > On Mon, Nov 9, 2015 at 10:53 AM, Guillaume Charhon < > guilla...@databerries.com> wrote: > >> For the first table: (device_id, timestamp), should I add a bucket even >> if I know I might have millions of events per device but never billions? >> >> On Mon, Nov 9, 2015 at 4:37 PM, Jack Krupansky <jack.krupan...@gmail.com> >> wrote: >> >>> Cassandra is good at two kinds of queries: 1) access a specific row by a >>> specific key, and 2) Access a slice or consecutive sequence of rows within >>> a given partition. >>> >>> It is recommended to avoid ALLOW FILTERING. If it happens to work well >>> for you, great, go for it, but if it doesn't then simply don't do it. Best >>> to redesign your data model to play to Cassandra's strengths. >>> >>> If you bucket the time-based table, do a separate query for each time >>> bucket. >>> >>> -- Jack Krupansky >>> >>> On Mon, Nov 9, 2015 at 10:16 AM, Guillaume Charhon < >>> guilla...@databerries.com> wrote: >>> >>>> Kai, Jack, >>>> >>>> On 1., should the bucket be a STRING with a date format or do I have a >>>> better option ? For (device_id, bucket, timestamp), did you mean >>>> ((device_id, bucket), timestamp) ? >>>> >>>> On 2., what are the risks of timeout ? I currently have this warning: >>>> "Cannot execute this query as it might involve data filtering and thus may >>>> have unpredictable performance. If you want to execute this query despite >>>> the performance unpredictability, use ALLOW FILTERING". >>>> >>>> On Mon, Nov 9, 2015 at 3:02 PM, Kai Wang <dep...@gmail.com> wrote: >>>> >>>>> 1. Don't make your partition unbound. It's tempting to just use >>>>> (device_id, timestamp). But soon or later you will have problem when time >>>>> goes by. You can keep the partition bound by using (device_id, bucket, >>>>> timestamp). Use hour, day, month or even year like Jack mentioned >>>>> depending >>>>> on the size of data. >>>>> >>>>> 2. As to your specific query, for a given partition and a time range, >>>>> C* doesn't need to load the whole partition then filter. It only retrieves >>>>> the slice within the time range from disk because the data is clustered by >>>>> timestamp. >>>>> >>>>> On Mon, Nov 9, 2015 at 8:13 AM, Jack Krupansky < >>>>> jack.krupan...@gmail.com> wrote: >>>>> >>>>>> The general rule in Cassandra data modeling is to look at all of your >>>>>> queries first and then to declare a table for each query, even if that >>>>>> means storing multiple copies of the data. So, create a second table with >>>>>> bucketed time as the partition key (hour, 15 minutes, or whatever time >>>>>> interval makes sense to give 1 to 10 megabytes per partition) and time >>>>>> and >>>>>> device as the clustering keys. >>>>>> >>>>>> Or, consider DSE SEarch and then you can do whatever ad hoc queries >>>>>> you want using Solr. Or Stratio or TupleJump Stargate for an open source >>>>>> Lucene plugin. >>>>>> >>>>>> -- Jack Krupansky >>>>>> >>>>>> On Mon, Nov 9, 2015 at 8:05 AM, Guillaume Charhon < >>>>>> guilla...@databerries.com> wrote: >>>>>> >>>>>>> Hello, >>>>>>> >>>>>>> We are currently storing geolocation events (about 1 per 5 minutes) >>>>>>> for each device we track. We currently have 2 TB of data. I would like >>>>>>> to >>>>>>> store the device_id, the timestamp of the event, latitude and >>>>>>> longitude. I >>>>>>> though about using the device_id as the partition key and timestamp as >>>>>>> the >>>>>>> clustering column. It is great as events are naturally grouped by device >>>>>>> (very useful for our Spark jobs). However, if I would like to retrieve >>>>>>> all >>>>>>> events of all devices of the last week I understood that Cassandra will >>>>>>> need to load all data and filter which does not seems to be clean on the >>>>>>> long term. >>>>>>> >>>>>>> How should I create my model? >>>>>>> >>>>>>> Best Regards >>>>>>> >>>>>> >>>>>> >>>>> >>>> >>> >> >