For the first table: (device_id, timestamp), should I add a bucket even if I know I might have millions of events per device but never billions?
On Mon, Nov 9, 2015 at 4:37 PM, Jack Krupansky <jack.krupan...@gmail.com> wrote: > Cassandra is good at two kinds of queries: 1) access a specific row by a > specific key, and 2) Access a slice or consecutive sequence of rows within > a given partition. > > It is recommended to avoid ALLOW FILTERING. If it happens to work well for > you, great, go for it, but if it doesn't then simply don't do it. Best to > redesign your data model to play to Cassandra's strengths. > > If you bucket the time-based table, do a separate query for each time > bucket. > > -- Jack Krupansky > > On Mon, Nov 9, 2015 at 10:16 AM, Guillaume Charhon < > guilla...@databerries.com> wrote: > >> Kai, Jack, >> >> On 1., should the bucket be a STRING with a date format or do I have a >> better option ? For (device_id, bucket, timestamp), did you mean >> ((device_id, bucket), timestamp) ? >> >> On 2., what are the risks of timeout ? I currently have this warning: >> "Cannot execute this query as it might involve data filtering and thus may >> have unpredictable performance. If you want to execute this query despite >> the performance unpredictability, use ALLOW FILTERING". >> >> On Mon, Nov 9, 2015 at 3:02 PM, Kai Wang <dep...@gmail.com> wrote: >> >>> 1. Don't make your partition unbound. It's tempting to just use >>> (device_id, timestamp). But soon or later you will have problem when time >>> goes by. You can keep the partition bound by using (device_id, bucket, >>> timestamp). Use hour, day, month or even year like Jack mentioned depending >>> on the size of data. >>> >>> 2. As to your specific query, for a given partition and a time range, C* >>> doesn't need to load the whole partition then filter. It only retrieves the >>> slice within the time range from disk because the data is clustered by >>> timestamp. >>> >>> On Mon, Nov 9, 2015 at 8:13 AM, Jack Krupansky <jack.krupan...@gmail.com >>> > wrote: >>> >>>> The general rule in Cassandra data modeling is to look at all of your >>>> queries first and then to declare a table for each query, even if that >>>> means storing multiple copies of the data. So, create a second table with >>>> bucketed time as the partition key (hour, 15 minutes, or whatever time >>>> interval makes sense to give 1 to 10 megabytes per partition) and time and >>>> device as the clustering keys. >>>> >>>> Or, consider DSE SEarch and then you can do whatever ad hoc queries >>>> you want using Solr. Or Stratio or TupleJump Stargate for an open source >>>> Lucene plugin. >>>> >>>> -- Jack Krupansky >>>> >>>> On Mon, Nov 9, 2015 at 8:05 AM, Guillaume Charhon < >>>> guilla...@databerries.com> wrote: >>>> >>>>> Hello, >>>>> >>>>> We are currently storing geolocation events (about 1 per 5 minutes) >>>>> for each device we track. We currently have 2 TB of data. I would like to >>>>> store the device_id, the timestamp of the event, latitude and longitude. I >>>>> though about using the device_id as the partition key and timestamp as the >>>>> clustering column. It is great as events are naturally grouped by device >>>>> (very useful for our Spark jobs). However, if I would like to retrieve all >>>>> events of all devices of the last week I understood that Cassandra will >>>>> need to load all data and filter which does not seems to be clean on the >>>>> long term. >>>>> >>>>> How should I create my model? >>>>> >>>>> Best Regards >>>>> >>>> >>>> >>> >> >