Re: Time series modeling in C* for range queries

Jon Haddad Sun, 19 Nov 2017 21:18:53 -0800

Hi Junaid,

I wrote a blog post a few months ago on massively scalable time series, going 
into a couple techniques on bucketing that you might find helpful.


http://thelastpickle.com/blog/2017/08/02/time-series-data-modeling-massive-scale.html
 
<http://thelastpickle.com/blog/2017/08/02/time-series-data-modeling-massive-scale.html>

Jon

> On Nov 19, 2017, at 2:09 PM, Junaid Nasir <jna...@an10.io> wrote:
> 
> We are building a IoT platform where time series data from millions of 
> devices is to be collected and then used to do some analytics pertaining to 
> Business Intelligence/Analytics (BI/BA). 
> 
> Within the above context, we are running into the issue of have range based 
> queries, where the granularity of the range can change. Furthermore, the 
> query aggregation unit can be on the basis of device ID and that too possibly 
> a collection of such devices. 
> 
> This leads us to model the table in the following way. 
> 
> C* table schemas
>                   
> // Per sensor table
> CREATE TABLE ks.fuel_signal (
>     deviceid text,
>     datetime timestamp,
>     value text,
>     PRIMARY KEY ((deviceid), datetime)
> ) WITH CLUSTERING ORDER BY (datetime ASC)
> // Per device
> CREATE TABLE ks.device1 (
>     sensor text,
>     datetime timestamp,
>     value text,
>     PRIMARY KEY ((sensor), datetime)
> ) WITH CLUSTERING ORDER BY (datetime ASC)
> 
>                 
>  <https://mixmax.com/r/592fe3b429b79365389d2354> NOT USING MIXMAX YET? 
> <https://mixmax.com/r/592fe3b429b79365389d2354>
> 
> While the above does solve the need to query a single sensor value over a 
> range of dev_ids, it does pose a problem when doing the same query for a time 
> range.The problem, simply put, is that the time-cost of doing a range query 
> (for the same range) in the above model increase with the size of the total 
> table (we are using spark DF for this purpose). 
> 
> The first question is that whether their is some best practice to handle such 
> a flow, which it seems to us is quite natural and should be common. 
> 
> Then, it seems to us, based on reading the mailing list and previous 
> discussion here that we can use date time stamps which collocate the 
> date_range as a table for the primary key. 
> 
> 
> date time buckets schema
>                   
> CREATE TABLE ks.fuel_signal (
>     deviceid text,
>     day date
>     datetime timestamp,
>     value text,
>     PRIMARY KEY ((day), datetime)
> ) WITH CLUSTERING ORDER BY (datetime ASC)
> 
>                 
>  <https://mixmax.com/r/592fe3b429b79365389d2354> NOT USING MIXMAX YET? 
> <https://mixmax.com/r/592fe3b429b79365389d2354>
> 
> 
> but this would be a very large number of IN [Array] if the date_range is 
> daily and we have a query that is over last 3 months.
> 
> So, is this the right way or is there a better way?
> 
> Thanks,
> Junaid
>

Re: Time series modeling in C* for range queries

Reply via email to