Re: Selecting rows efficiently from a Cassandra CF containing time series data

Hiller, Dean Tue, 11 Dec 2012 07:09:25 -0800

We use PlayOrm to do something similar

We have an object like this(typing all this from memory)….


TimeSeries {

   @NoSqlPartitionedByField
   private long beginOfMonth;
   @NoSqlIndexed
   Private long timestamp;

}

Then we just use the ScalableSQL to query into the partition itself.  This is 
all on random partitioner as well.  We could partition by day if we had way 
more of a dataload, but we tend not to need that.  The query looks something 
like this  "PARTITIONS s(:beginOfMonth) select s from TimeSeries as s";  OR 
"PARTITIONS s(:beginOfMonth) select s from TimeSeries as s where s.time > 
:start and s.time < :end"

Later,
Dean

From: Chin Ko <cko2...@gmail.com<mailto:cko2...@gmail.com>>
Reply-To: "user@cassandra.apache.org<mailto:user@cassandra.apache.org>" 
<user@cassandra.apache.org<mailto:user@cassandra.apache.org>>
Date: Tuesday, December 11, 2012 7:23 AM
To: "user@cassandra.apache.org<mailto:user@cassandra.apache.org>" 
<user@cassandra.apache.org<mailto:user@cassandra.apache.org>>
Subject: Selecting rows efficiently from a Cassandra CF containing time series 
data

I would like to get some opinions on how to select an incremental range of rows 
efficiently from a Cassandra CF containing time series data.

Background:
We have a web application that uses a Cassandra CF as logging storage. We 
insert a row into the CF for every "event" of each user of the web application. 
The row key is timestamp+userid. The column values are unstructured data. We 
only insert rows but never update or delete any rows in the CF.

Data volume:
The CF grows by about 0.5 million rows per day. We have a 4 node cluster and 
use the RandomPartitioner to spread the rows across the nodes.

Requirements:
There is a need to transfer the Cassandra data to another relational database 
periodically. Due to the large size of the CF, instead of truncating the 
relational table and reloading all rows into it each time, we plan to run a job 
to select the "delta" rows since the last run and insert them into the 
relational database.

We would like to have some flexibility in how often the data transfer job is 
done. It may be run several times each day, or it may be not run at all on a 
day.

Options considered:
- We are using RandomPartitioner, so range scan by row key is not feasible.
- Add a secondary index on the timestamp column, but reading rows via secondary 
index still requires an equality condition and does not support range scan.
- Add a secondary index on a column containing the date and hour of the 
timestamp. Iterate each hour between the time job was last run and now. Fetch 
all rows of each hour.

I would appreciate any ideas of other design options of the Cassandra CF to 
enable extracting the rows efficiently.

Besides Java, has anyone used any ETL tools to do this kind of delta extraction 
from Cassandra?

Thanks,
Chin

Re: Selecting rows efficiently from a Cassandra CF containing time series data

Reply via email to