> Approach 1:
> 1. Get chucks of 10,000 keys (which is configurable, but when I increase it 
> to more than 15,000, I get a thrift frame size error cassandra. To fix it, I 
> will need to increase that frame size via cassandra.yml)  and its columns 
> (around 15 columns/key).
> 
You can model this on the way the Hadoop ColumnFamilyRecordReader works. Run it 
in parallel on every node in the cluster, have each process only read the rows 
which are in the primary token range for the node it's running on. For the 
first range_slice query use the token range for the node, for the subsequent 
queries convert the last row key to a token and use that as the start token. 

IMHO 10K rows per slice is too many, I would start at 1K. More is not always 
better. 
 
> 1. What is the suggest strategy to read bulk data from cassandra? Which read 
> pattern is better, one big get range slide with 10,000 keys-columns or 
> multiple small GETs for every keys?
Somewhere there is a sweet spot. Big queries hurt overall query throughput on 
the nodes and can lead to memory/GC issues on the client and servers. Lots of 
small queries result in more time spent waiting for network latency. Start 
small and find the point where the overall throughput stops improving, then 
make sure you are not hurting the throughput for other clients. 
 
> 2. How about reading more values at once, say 50,000 keys-columns by 
> increasing the thrift frame size from 16Mb to something greater like 54MB? 
> How will it impact cassandra's performance in general?
It will result in increased GC pressure.

Cheers

-----------------
Aaron Morton
Freelance Cassandra Consultant
New Zealand

@aaronmorton
http://www.thelastpickle.com

On 28/03/2013, at 1:44 PM, Utkarsh Sengar <utkarsh2...@gmail.com> wrote:

> Hello,
> 
> I am trying to implement an indexer for a column family in cassandra (cluster 
> of 4 nodes) using elastic search. There is a river plugin which I am writing 
> which retrieves data from cassandra and throws to elastic search. It is 
> triggered once a day (which is configurable based on the requirement).
> 
> Total keys: ~50M
> 
> So for reading the whole column family (random partition), I am going ahead 
> with this approach:
> As mentioned here, I use this example (PaginateGetRangeSlices.java):
> 
> Approach 1:
> 1. Get chucks of 10,000 keys (which is configurable, but when I increase it 
> to more than 15,000, I get a thrift frame size error cassandra. To fix it, I 
> will need to increase that frame size via cassandra.yml)  and its columns 
> (around 15 columns/key).
> 2. Then send 15,000 read records to elastic search.
> 3. It is single threaded for now. It will be hard to make this multithreaded 
> because I will need to track the range of keys which is already read and 
> share start key value. with every thread. Think PaginateGetRangeSlices.java 
> example, but multi-threaded.
> 
> I have implemented this approach, its not that fast. Takes about 6hours to 
> complete.
> 
> Approach 2:
> 1. Get all the keys using same query as above. But retrieve only the key.
> 2. Divide the keys by x. Where x will the total threads I spawn. Every 
> individual thread will do an individual GET for a key and insert it in 
> elastic search. This will considerably increase hits to cassandra, but sounds 
> more efficient.
> 
> 
> So my questions are:
> 1. What is the suggest strategy to read bulk data from cassandra? Which read 
> pattern is better, one big get range slide with 10,000 keys-columns or 
> multiple small GETs for every keys?
> 
> 2. How about reading more values at once, say 50,000 keys-columns by 
> increasing the thrift frame size from 16Mb to something greater like 54MB? 
> How will it impact cassandra's performance in general?
> 
> Will appreciate your input about any other strategies you use to move bulk 
> data from cassandra.
> 
> -- 
> Thanks,
> -Utkarsh

Reply via email to