Re: Avoiding Data Duplication

Paulo Motta Mon, 08 Jun 2015 05:40:27 -0700

Some options I can think of:

1 - depending on your data size and stime query frequency, you may use
spark to peform queries filtering by server time in the log table, maybe
within an device time window to reduce the dataset your spark job will need
to go through. more info on the spark connector:
https://github.com/datastax/spark-cassandra-connector


2 - if dtime and stime are almost always in the same date bucket
(day/hour/minute/second), you may create an additional table stable_log
with the same structure, but the date bucket refers to the sdate field. so,
when you have an entry when stime and dtime are not from the same bucket,
you should insert that entry in both the log and stime_log tables. when you
want to query entries by stime, you take the distinct union of the query of
both tables in your client application. this way, you only duplicate
delayed data.

3 - if you "data" field is big and you can't afford duplicating that,
create an additional table stable_log, but do not store the data field,
only the metadata (imei, date, dtime, stime).. so when you want to query by
stime, first query the stable_log, and then query the original log table to
fetch the data field.

2015-06-05 18:10 GMT-03:00 Abhishek Singh Bailoo <
abhishek.singh.bai...@gmail.com>:

> Hello!
>
> I have a column family to log in data coming from my GPS devices.
>
> CREATE TABLE log(
>   imei ascii,
>   date ascii,
>   dtime timestamp,
>   data ascii,
>   stime timestamp,
>   PRIMARY KEY ((imei, date), dtime))
>   WITH CLUSTERING ORDER BY (dtime DESC)
> ;
>
> It is the standard schema for modeling time series data where
> imei is the unique ID associated with each GPS device
> date is the date taken from dtime
> dtime is the date-time coming from the device
> data is all the latitude, longitude etc that the device is sending us
> stime is the date-time stamp of the server
>
> The reason why I put dtime in the primary key as the clustering column is
> because most of our queries are done on device time. There can be a delay
> of a few minutes to a few hours (or a few days! in rare cases) between
> dtime and stime if the network is not available.
>
> However, now we want to query on server time as well for the purpose of
> debugging. These queries will be not as common as queries on  device time.
> Say for every 100 queries on dtime there will be just 1 query on stime.
>
> What options do I have?
>
> 1. Seconday Index - not possible because stime is a timestamp and CQL does
> not allow me to put < or > in the query for secondary index
>
> 2. Data duplication - I can build another column family where I will index
> by stime but that means I am storing twice as much data. I know everyone
> says that write operations are cheap and storage is cheap but how? If I
> have to buy twice as many machines on AWS EC2 each with their own ephemeral
> storage, then my bill doubles up!
>
> Any other ideas I can try?
>
> Many Thanks,
> Abhishek
>

Re: Avoiding Data Duplication

Reply via email to