We will use Cassandra as logging storage in one of our web application. The
application only insert rows into Cassandra but never update or delete any
rows. The CF is expected to grow by about 0.5 million rows per day.
We need to transfer the data in Cassandra to another relational database daily.
Due to the large size of the CF, instead of truncating the relational table and
reloading all rows into it each time, we plan to run a job to select the
"delta" rows since the last run and insert them into the relational database.
We know we can use Java, Pig or Hive to extract the delta rows to a flat file
and load the data into the target relational table. We are particularly
interested in a process that can extract delta rows without scanning the entire
CF.
Has anyone used any other ETL tools to do this kind of delta extraction from
Cassandra? We appreciate any comments and experience.
Thanks,
Chin