We are building a IoT platform where time series data from millions of devices
is to be collected and then used to do some analytics pertaining to Business
Intelligence/Analytics (BI/BA).
Within the above context, we are running into the issue of have range based
queries, where the granularity of the range can change. Furthermore, the query
aggregation unit can be on the basis of device ID and that too possibly a
collection of such devices.
This leads us to model the table in the following way.
C* table schemas // Per sensor tableCREATE TABLE ks.fuel_signal ( deviceid text, datetime timestamp, value text, PRIMARY KEY ((deviceid), datetime)) WITH CLUSTERING ORDER BY (datetime ASC)// Per deviceCREATE TABLE ks.device1 ( sensor text, datetime timestamp, value text, PRIMARY KEY ((sensor), datetime)) WITH CLUSTERING ORDER BY (datetime ASC)
Not using Mixmax yet?
While the above does solve the need to query a single sensor value over a range
of dev_ids, it does pose a problem when doing the same query for a time
range.The problem, simply put, is that the time-cost of doing a range query (for
the same range) in the above model increase with the size of the total table (we
are using spark DF for this purpose).
The first question is that whether their is some best practice to handle such a
flow, which it seems to us is quite natural and should be common.
Then, it seems to us, based on reading the mailing list and previous discussion
here that we can use date time stamps which collocate the date_range as a table
for the primary key.
date time buckets schema CREATE TABLE ks.fuel_signal (
deviceid text, day date datetime timestamp, value text, PRIMARY KEY
((day), datetime)) WITH CLUSTERING ORDER BY (datetime ASC)
Not using Mixmax yet?
but this would be a very large number of IN [Array] if the date_range is daily
and we have a query that is over last 3 months.
So, is this the right way or is there a better way?
Thanks,Junaid