Re: parallel processing - splitting data

siddharth verma Thu, 19 Jan 2017 04:19:01 -0800

Hi Frank,
You could try this
https://github.com/siddv29/cfs


I have processed 1.2 billion rows in 480 seconds with just 20 threads on
client side.
C* 3.0.9
Nodes = 6
RF = 3

Have a go at it. You might be surprised.

Regards,


On Thu, Jan 19, 2017 at 5:35 PM, Frank Hughes <frankhughes...@gmail.com>
wrote:

> Hello there,
>
> I'm running a 4 node cluster of Cassandra 3.9 with a replication factor of
> 4.
>
> I want to be able to run a java process on each node only selecting a 25%
> of the data on each node,
> so i can process all of the data in parallel on each node.
>
> What is the best way to do this with the java driver ?
>
> I was assuming I could retrieve the token ranges for each node and page
> through the data using these ranges, but this includes the replicated data.
> I was hoping there was away of only selecting the data that a node is
> responsible for and avoiding the replicated data.
>
> Many thanks for any help and guidance,
>
> Frank Hughes
>



-- 
Siddharth Verma
(Visit https://github.com/siddv29/cfs for a high speed cassandra full table
scan)

Re: parallel processing - splitting data

Reply via email to