Hello,

I am trying to implement an indexer for a column family in cassandra
(cluster of 4 nodes) using elastic search. There is a river
plugin<http://www.elasticsearch.org/guide/reference/river/>which I am
writing which retrieves data from cassandra and throws to
elastic search. It is triggered once a day (which is configurable based on
the requirement).

Total keys: ~50M

So for reading the whole column family (random partition), I am going ahead
with this approach:
As mentioned here <http://wiki.apache.org/cassandra/FAQ#iter_world>, I use
this example 
(<https://github.com/zznate/hector-examples/blob/master/src/main/java/com/riptano/cassandra/hector/example/PaginateGetRangeSlices.java>
PaginateGetRangeSlices.java):

*Approach 1:*
1. Get chucks of 10,000 keys (which is configurable, but when I increase it
to more than 15,000, I get a thrift frame size error cassandra. To fix it,
I will need to increase that frame size via cassandra.yml)  and its columns
(around 15 columns/key).
2. Then send 15,000 read records to elastic search.
3. It is single threaded for now. It will be hard to make this
multithreaded because I will need to track the range of keys which is
already read and share start key value. with every thread. Think
PaginateGetRangeSlices.java example, but multi-threaded.

I have implemented this approach, its not that fast. Takes about 6hours to
complete.

*Approach 2:*
1. Get all the keys using same query as above. But retrieve only the key.
2. Divide the keys by x. Where x will the total threads I spawn. Every
individual thread will do an individual GET for a key and insert it in
elastic search. This will considerably increase hits to cassandra, but
sounds more efficient.


*So my questions are:*
1. What is the suggest strategy to read bulk data from cassandra? Which
read pattern is better, one big get range slide with 10,000 keys-columns or
multiple small GETs for every keys?

2. How about reading more values at once, say 50,000 keys-columns by
increasing the thrift frame size from 16Mb to something greater like 54MB?
How will it impact cassandra's performance in general?

Will appreciate your input about any other strategies you use to move bulk
data from cassandra.

-- 
Thanks,
-Utkarsh

Reply via email to