Fetch 1 millions record

anil_ah Fri, 08 May 2015 03:43:58 -0700

Hi
   I have to fetch around 1 millions row based on secondary index and update 
them and push back in cassandra based on partition key.please advise 
Bset approch.i have to use a java client datastax driver.


Regards 
Anil 



Sent from Samsung Mobile

<div>-------- Original message --------</div><div>From: "Peer, Oded" 
<oded.p...@rsa.com> </div><div>Date:04/26/2015  4:29 PM  (GMT+08:00) 
</div><div>To: user@cassandra.apache.org </div><div>Cc:  </div><div>Subject: 
RE: Data model suggestions </div><div>
</div>I would maintain two tables.
An “archive” table that holds all the active and inactive records, and is 
updated hourly (re-inserting the same record has some compaction overhead but 
on the other side deleting records has tombstones overhead).
An “active” table which holds all the records in the last external API 
invocation.
To avoid tombstones and read-before-delete issues “active” should actually a 
synonym, an alias, to the most recent active table.
I suggest you create two identical tables, “active1” and “active2”, and an 
“active_alias” table that informs which of the two is the most recent.
Thus when you query the external API you insert the data to “archive” and to 
the unaliased “activeN” table, switch the alias value in “active_alias” and 
truncate the new unaliased “activeM” table.
No need to query the data before inserting it. Make sure truncating doesn’ate 
automatic snapshots.
 
 
From: Narendra Sharma [mailto:narendra.sha...@gmail.com] 
Sent: Friday, April 24, 2015 6:53 AM
To: user@cassandra.apache.org
Subject: Re: Data model suggestions
 
I think one table say record should be good. The primary key is record id. This 
will ensure good distribution. 
Just update the active attribute to true or false.
For range query on active vs archive records maintain 2 indexes or try 
secondary index.

On Apr 23, 2015 1:32 PM, "Ali Akhtar" <ali.rac...@gmail.com> wrote:
Good point about the range selects. I think they can be made to work with 
limits, though. Or, since the active records will never usually be > 500k, the 
ids may just be cached in memory.
 
Most of the time, during reads, the queries will just consist of select * where 
primaryKey = someValue . One row at a time.
 
The question is just, whether to keep all records in one table (including 
archived records which wont be queried 99% of the time), or to keep active 
records in their own table, and delete them when they're no longe. Will that 
produce tombstone issues?
 
On Fri, Apr 24, 2015 at 12:56 AM, Manoj Khangaonkar <khangaon...@gmail.com> 
wrote:
Hi,

If your external API returns active records, that means I am guessing you need 
to do a select * on the active table to figure out which records in the table 
are no longer active.

You might be aware that range selects based on partition key will timeout in 
cassandra. They can however be made to work using the column cluster key.

To comment more, We would need to see your proposed cassandres and queries that 
you might need to run.

regards
 
 

 
On Thu, Apr 23, 2015 at 9:45 AM, Ali Akhtar <ali.rac...@gmail.com> wrote:
That's returned by the external API we're querying. We query them for active 
records, if a previous active record isn't includee results, that means its 
time to archive that record.
 
On Thu, Apr 23, 2015 at 9:20 PM, Manoj Khangaonkar <khangaonkar@gmail.corote:
Hi,

How do you determine if the record is no longer activet a perioidic process 
that goes gh every record checks whe latd

Fetch 1 millions record

Reply via email to