Hi Junaid, I wrote a blog post a few months ago on massively scalable time series, going into a couple techniques on bucketing that you might find helpful.
http://thelastpickle.com/blog/2017/08/02/time-series-data-modeling-massive-scale.html <http://thelastpickle.com/blog/2017/08/02/time-series-data-modeling-massive-scale.html> Jon > On Nov 19, 2017, at 2:09 PM, Junaid Nasir <jna...@an10.io> wrote: > > We are building a IoT platform where time series data from millions of > devices is to be collected and then used to do some analytics pertaining to > Business Intelligence/Analytics (BI/BA). > > Within the above context, we are running into the issue of have range based > queries, where the granularity of the range can change. Furthermore, the > query aggregation unit can be on the basis of device ID and that too possibly > a collection of such devices. > > This leads us to model the table in the following way. > > C* table schemas > > // Per sensor table > CREATE TABLE ks.fuel_signal ( > deviceid text, > datetime timestamp, > value text, > PRIMARY KEY ((deviceid), datetime) > ) WITH CLUSTERING ORDER BY (datetime ASC) > // Per device > CREATE TABLE ks.device1 ( > sensor text, > datetime timestamp, > value text, > PRIMARY KEY ((sensor), datetime) > ) WITH CLUSTERING ORDER BY (datetime ASC) > > > <https://mixmax.com/r/592fe3b429b79365389d2354> NOT USING MIXMAX YET? > <https://mixmax.com/r/592fe3b429b79365389d2354> > > While the above does solve the need to query a single sensor value over a > range of dev_ids, it does pose a problem when doing the same query for a time > range.The problem, simply put, is that the time-cost of doing a range query > (for the same range) in the above model increase with the size of the total > table (we are using spark DF for this purpose). > > The first question is that whether their is some best practice to handle such > a flow, which it seems to us is quite natural and should be common. > > Then, it seems to us, based on reading the mailing list and previous > discussion here that we can use date time stamps which collocate the > date_range as a table for the primary key. > > > date time buckets schema > > CREATE TABLE ks.fuel_signal ( > deviceid text, > day date > datetime timestamp, > value text, > PRIMARY KEY ((day), datetime) > ) WITH CLUSTERING ORDER BY (datetime ASC) > > > <https://mixmax.com/r/592fe3b429b79365389d2354> NOT USING MIXMAX YET? > <https://mixmax.com/r/592fe3b429b79365389d2354> > > > but this would be a very large number of IN [Array] if the date_range is > daily and we have a query that is over last 3 months. > > So, is this the right way or is there a better way? > > Thanks, > Junaid >