Hi,
I'm finding very difficult to try to understand how Hadoop and Cassandra 
(CDH3u3 and 1.0.8 respectively) splits the work between mappers.


The thing that confuses me is that, for any value of cassandra.input.split.size 
I set, I always get 1 (at most 2) mapper per node.

I'm trying to debug the Cassandra code connecting with a 3 node cluster, and I 
notice the following things

** ColumnFamilyInputFormat.getRangeMap returns (correctly, I assume) 3 ranges  
[TokenRange(start_token:0, end_token:56713727820156410577229101238628035242, ….
TokenRange(start_token:56713727820156410577229101238628035242, 
end_token:113427455640312814857969558651062452224, ….
TokenRange(start_token:113427455640312814857969558651062452224, end_token:0, 
…….]

** Inside the SplitCallable object, the getSubsplits methods always return 1 
split.  
Irregardless of the splitSize, the call to client.describe_splits(..)   always 
return 1 split (which is the original range).


I should mention  also that the CF I'm trying to map/reduce is composed of 
around 1500 rows, and I've tried split size ranging from 1000 to 10 without 
change, except for a "sweet spot" split size of 120 that creates exactly 2 
mappers per node. However, decreasing the split size under 120 has the effect 
of Hadoop creating again 1 mapper per node.

It seems to me that, with my current Cassandra configuration, the 
describe_splits RPC call always return 1 or 2, irregardless of the 
keys_per_split value passed.

Is it maybe a Cassadra configuration? Or can it be a bug in the code?

Thanks,
--  
Filippo Diotalevi

Reply via email to