Consider adding log_bucket timestamp, and then indexing that column. Your data loader can SELECT * FROM logs WHERE log_bucket = ?. The value you supply there would be the timestamp log bucket you're processing - in your case logged_at % 5.
However, I'll caution against writing data to Cassandra and then trying to reliably read it back immediately after. You're likely to miss values this way due to eventual consistency unless you read at CL_ALL. But then your data loader will break whenever you have any node offline. Writing then immediately reading data is a typical antipattern in any eventually consistent system. If using DataStax Java Driver you can use CL_ALL with DowngradingConsistencyRetryPolicy and you would at least strike a nice balance between reasonably strong consistency and loss of resiliency from CL_ALL (but when you have a node offline, your load process may get significantly slower). This would mitigate but not eliminate the antipattern. On Tue Nov 25 2014 at 2:11:36 AM Vinod Joseph <vinodjosep...@gmail.com> wrote: > Hi, > > I am working on a java plugin which moves data from cassandra > to elasticsearch. This plugin must run in the server for every 5 seconds. > The data is getting moved, but the issue is that every time the plugin > runs(ie after every 5 seconds) all the data, including data which has been > already moved into elasticsearch in the previous iteration is moving to it. > So we are having duplicate values in the elastic search. How to avoid this > problem. > > We are using this plugin to manage logs which are generated during any > online transaction. So we will be having millions of transactions. > Following is the table schema. > > CREATE TABLE logs ( > txn_id text, > logged_at timestamp, > des text, > key_name text, > params text, > PRIMARY KEY (txn_id, logged_at) > ) > > The txn_id is a 16 digit number and is not unique. It is a combination of > 6 random numbers generated using a random function, followed by the epoch > timestamp in millisec(10 digits). > > I want to move only the data which has been generated after the previous > transaction and not the data which was already moved in the previous > transaction. > I tried to do it with static values, counter variables, comparing the > write_time of each row and order by. Still its not working . Please suggest > me any ideas. > > > Thanks and regards > vinod joseph > 8050136948 > > >