I've been trying to improve the time it takes to map 30 million rows using a
hadoop / cassandra cluster with 30 nodes.  I discovered that since
CassandraInputFormat returns an ordered list of splits, when there are many
splits (e.g. hundreds or more) the load on cassandra is horribly unbalanced.
 e.g. if I have 30 tasks processing 600 splits, then the first 30 splits are
all located on the same one or two nodes.

I added *Collections.shuffle(splits) *before returning the splits in
getSplits().  As a result, the load is much better distributed, throughput
 was increased (about 3X in my case) and TimedOutExceptions were all but
eliminated.

Joost.

Reply via email to