Thank you for all theses details. I will do a monthly bucket. On Mon, Nov 9, 2015 at 7:58 PM, Kai Wang <dep...@gmail.com> wrote:
> bucket key is just like any column of the table, you can use any type as > long as it's convenient for you to write the query. > > But I don't think you should use 5 minute as your bucket key since you > only have 1 event every 5 minute. 5-minute bucket seems too small. The > bucket key we mentioned is for you to break the (device_id, timestamp) > partitions into ones with size between ~1MB to ~10MB. > > On Mon, Nov 9, 2015 at 11:50 AM, Guillaume Charhon < > guilla...@databerries.com> wrote: > >> Is it usually recommended to use the bucket key (usually an 5 minutes >> period in my case) for the table of the events_by_time using a timestamp or >> a string? >> >> On Mon, Nov 9, 2015 at 5:05 PM, Kai Wang <dep...@gmail.com> wrote: >> >>> it depends on the size of each event. You want to bound each partition >>> under ~10MB. In system.log look for entry like: >>> >>> WARN [CompactionExecutor:39] 2015-11-07 17:32:00,019 >>> SSTableWriter.java:240 - Compacting large partition >>> xxxx:9f80ce31-b7e7-40c7-b642-f5d03fc320aa (13443863224 bytes) >>> >>> This is the warning sign that you have large partitions. The threshold >>> is defined by compaction_large_partition_warning_threshold_mb in >>> cassandra.yaml. The default is 100MB. >>> >>> You can also use nodetool cfstats to check partition size. >>> >>> On Mon, Nov 9, 2015 at 10:53 AM, Guillaume Charhon < >>> guilla...@databerries.com> wrote: >>> >>>> For the first table: (device_id, timestamp), should I add a bucket >>>> even if I know I might have millions of events per device but never >>>> billions? >>>> >>>> On Mon, Nov 9, 2015 at 4:37 PM, Jack Krupansky < >>>> jack.krupan...@gmail.com> wrote: >>>> >>>>> Cassandra is good at two kinds of queries: 1) access a specific row by >>>>> a specific key, and 2) Access a slice or consecutive sequence of rows >>>>> within a given partition. >>>>> >>>>> It is recommended to avoid ALLOW FILTERING. If it happens to work well >>>>> for you, great, go for it, but if it doesn't then simply don't do it. Best >>>>> to redesign your data model to play to Cassandra's strengths. >>>>> >>>>> If you bucket the time-based table, do a separate query for each time >>>>> bucket. >>>>> >>>>> -- Jack Krupansky >>>>> >>>>> On Mon, Nov 9, 2015 at 10:16 AM, Guillaume Charhon < >>>>> guilla...@databerries.com> wrote: >>>>> >>>>>> Kai, Jack, >>>>>> >>>>>> On 1., should the bucket be a STRING with a date format or do I have >>>>>> a better option ? For (device_id, bucket, timestamp), did you mean >>>>>> ((device_id, bucket), timestamp) ? >>>>>> >>>>>> On 2., what are the risks of timeout ? I currently have this warning: >>>>>> "Cannot execute this query as it might involve data filtering and thus >>>>>> may >>>>>> have unpredictable performance. If you want to execute this query despite >>>>>> the performance unpredictability, use ALLOW FILTERING". >>>>>> >>>>>> On Mon, Nov 9, 2015 at 3:02 PM, Kai Wang <dep...@gmail.com> wrote: >>>>>> >>>>>>> 1. Don't make your partition unbound. It's tempting to just use >>>>>>> (device_id, timestamp). But soon or later you will have problem when >>>>>>> time >>>>>>> goes by. You can keep the partition bound by using (device_id, bucket, >>>>>>> timestamp). Use hour, day, month or even year like Jack mentioned >>>>>>> depending >>>>>>> on the size of data. >>>>>>> >>>>>>> 2. As to your specific query, for a given partition and a time >>>>>>> range, C* doesn't need to load the whole partition then filter. It only >>>>>>> retrieves the slice within the time range from disk because the data is >>>>>>> clustered by timestamp. >>>>>>> >>>>>>> On Mon, Nov 9, 2015 at 8:13 AM, Jack Krupansky < >>>>>>> jack.krupan...@gmail.com> wrote: >>>>>>> >>>>>>>> The general rule in Cassandra data modeling is to look at all of >>>>>>>> your queries first and then to declare a table for each query, even if >>>>>>>> that >>>>>>>> means storing multiple copies of the data. So, create a second table >>>>>>>> with >>>>>>>> bucketed time as the partition key (hour, 15 minutes, or whatever time >>>>>>>> interval makes sense to give 1 to 10 megabytes per partition) and time >>>>>>>> and >>>>>>>> device as the clustering keys. >>>>>>>> >>>>>>>> Or, consider DSE SEarch and then you can do whatever ad hoc >>>>>>>> queries you want using Solr. Or Stratio or TupleJump Stargate for an >>>>>>>> open >>>>>>>> source Lucene plugin. >>>>>>>> >>>>>>>> -- Jack Krupansky >>>>>>>> >>>>>>>> On Mon, Nov 9, 2015 at 8:05 AM, Guillaume Charhon < >>>>>>>> guilla...@databerries.com> wrote: >>>>>>>> >>>>>>>>> Hello, >>>>>>>>> >>>>>>>>> We are currently storing geolocation events (about 1 per 5 >>>>>>>>> minutes) for each device we track. We currently have 2 TB of data. I >>>>>>>>> would >>>>>>>>> like to store the device_id, the timestamp of the event, latitude and >>>>>>>>> longitude. I though about using the device_id as the partition key and >>>>>>>>> timestamp as the clustering column. It is great as events are >>>>>>>>> naturally >>>>>>>>> grouped by device (very useful for our Spark jobs). However, if I >>>>>>>>> would >>>>>>>>> like to retrieve all events of all devices of the last week I >>>>>>>>> understood >>>>>>>>> that Cassandra will need to load all data and filter which does not >>>>>>>>> seems >>>>>>>>> to be clean on the long term. >>>>>>>>> >>>>>>>>> How should I create my model? >>>>>>>>> >>>>>>>>> Best Regards >>>>>>>>> >>>>>>>> >>>>>>>> >>>>>>> >>>>>> >>>>> >>>> >>> >> >