it depends on the size of each event. You want to bound each partition under ~10MB. In system.log look for entry like:
WARN [CompactionExecutor:39] 2015-11-07 17:32:00,019 SSTableWriter.java:240 - Compacting large partition xxxx:9f80ce31-b7e7-40c7-b642-f5d03fc320aa (13443863224 bytes) This is the warning sign that you have large partitions. The threshold is defined by compaction_large_partition_warning_threshold_mb in cassandra.yaml. The default is 100MB. You can also use nodetool cfstats to check partition size. On Mon, Nov 9, 2015 at 10:53 AM, Guillaume Charhon < guilla...@databerries.com> wrote: > For the first table: (device_id, timestamp), should I add a bucket even > if I know I might have millions of events per device but never billions? > > On Mon, Nov 9, 2015 at 4:37 PM, Jack Krupansky <jack.krupan...@gmail.com> > wrote: > >> Cassandra is good at two kinds of queries: 1) access a specific row by a >> specific key, and 2) Access a slice or consecutive sequence of rows within >> a given partition. >> >> It is recommended to avoid ALLOW FILTERING. If it happens to work well >> for you, great, go for it, but if it doesn't then simply don't do it. Best >> to redesign your data model to play to Cassandra's strengths. >> >> If you bucket the time-based table, do a separate query for each time >> bucket. >> >> -- Jack Krupansky >> >> On Mon, Nov 9, 2015 at 10:16 AM, Guillaume Charhon < >> guilla...@databerries.com> wrote: >> >>> Kai, Jack, >>> >>> On 1., should the bucket be a STRING with a date format or do I have a >>> better option ? For (device_id, bucket, timestamp), did you mean >>> ((device_id, bucket), timestamp) ? >>> >>> On 2., what are the risks of timeout ? I currently have this warning: >>> "Cannot execute this query as it might involve data filtering and thus may >>> have unpredictable performance. If you want to execute this query despite >>> the performance unpredictability, use ALLOW FILTERING". >>> >>> On Mon, Nov 9, 2015 at 3:02 PM, Kai Wang <dep...@gmail.com> wrote: >>> >>>> 1. Don't make your partition unbound. It's tempting to just use >>>> (device_id, timestamp). But soon or later you will have problem when time >>>> goes by. You can keep the partition bound by using (device_id, bucket, >>>> timestamp). Use hour, day, month or even year like Jack mentioned depending >>>> on the size of data. >>>> >>>> 2. As to your specific query, for a given partition and a time range, >>>> C* doesn't need to load the whole partition then filter. It only retrieves >>>> the slice within the time range from disk because the data is clustered by >>>> timestamp. >>>> >>>> On Mon, Nov 9, 2015 at 8:13 AM, Jack Krupansky < >>>> jack.krupan...@gmail.com> wrote: >>>> >>>>> The general rule in Cassandra data modeling is to look at all of your >>>>> queries first and then to declare a table for each query, even if that >>>>> means storing multiple copies of the data. So, create a second table with >>>>> bucketed time as the partition key (hour, 15 minutes, or whatever time >>>>> interval makes sense to give 1 to 10 megabytes per partition) and time and >>>>> device as the clustering keys. >>>>> >>>>> Or, consider DSE SEarch and then you can do whatever ad hoc queries >>>>> you want using Solr. Or Stratio or TupleJump Stargate for an open source >>>>> Lucene plugin. >>>>> >>>>> -- Jack Krupansky >>>>> >>>>> On Mon, Nov 9, 2015 at 8:05 AM, Guillaume Charhon < >>>>> guilla...@databerries.com> wrote: >>>>> >>>>>> Hello, >>>>>> >>>>>> We are currently storing geolocation events (about 1 per 5 minutes) >>>>>> for each device we track. We currently have 2 TB of data. I would like to >>>>>> store the device_id, the timestamp of the event, latitude and longitude. >>>>>> I >>>>>> though about using the device_id as the partition key and timestamp as >>>>>> the >>>>>> clustering column. It is great as events are naturally grouped by device >>>>>> (very useful for our Spark jobs). However, if I would like to retrieve >>>>>> all >>>>>> events of all devices of the last week I understood that Cassandra will >>>>>> need to load all data and filter which does not seems to be clean on the >>>>>> long term. >>>>>> >>>>>> How should I create my model? >>>>>> >>>>>> Best Regards >>>>>> >>>>> >>>>> >>>> >>> >> >