Hello, I am trying to implement an indexer for a column family in cassandra (cluster of 4 nodes) using elastic search. There is a river plugin<http://www.elasticsearch.org/guide/reference/river/>which I am writing which retrieves data from cassandra and throws to elastic search. It is triggered once a day (which is configurable based on the requirement).
Total keys: ~50M So for reading the whole column family (random partition), I am going ahead with this approach: As mentioned here <http://wiki.apache.org/cassandra/FAQ#iter_world>, I use this example (<https://github.com/zznate/hector-examples/blob/master/src/main/java/com/riptano/cassandra/hector/example/PaginateGetRangeSlices.java> PaginateGetRangeSlices.java): *Approach 1:* 1. Get chucks of 10,000 keys (which is configurable, but when I increase it to more than 15,000, I get a thrift frame size error cassandra. To fix it, I will need to increase that frame size via cassandra.yml) and its columns (around 15 columns/key). 2. Then send 15,000 read records to elastic search. 3. It is single threaded for now. It will be hard to make this multithreaded because I will need to track the range of keys which is already read and share start key value. with every thread. Think PaginateGetRangeSlices.java example, but multi-threaded. I have implemented this approach, its not that fast. Takes about 6hours to complete. *Approach 2:* 1. Get all the keys using same query as above. But retrieve only the key. 2. Divide the keys by x. Where x will the total threads I spawn. Every individual thread will do an individual GET for a key and insert it in elastic search. This will considerably increase hits to cassandra, but sounds more efficient. *So my questions are:* 1. What is the suggest strategy to read bulk data from cassandra? Which read pattern is better, one big get range slide with 10,000 keys-columns or multiple small GETs for every keys? 2. How about reading more values at once, say 50,000 keys-columns by increasing the thrift frame size from 16Mb to something greater like 54MB? How will it impact cassandra's performance in general? Will appreciate your input about any other strategies you use to move bulk data from cassandra. -- Thanks, -Utkarsh