Oh, very interesting.  I assumed Hadoop would be smart enough to
load-balance the jobs it sends out.  Guess not.

Can you submit a patch?

On Wed, May 12, 2010 at 12:32 PM, Joost Ouwerkerk <jo...@openplaces.org> wrote:
> I've been trying to improve the time it takes to map 30 million rows using a
> hadoop / cassandra cluster with 30 nodes.  I discovered that since
> CassandraInputFormat returns an ordered list of splits, when there are many
> splits (e.g. hundreds or more) the load on cassandra is horribly unbalanced.
>  e.g. if I have 30 tasks processing 600 splits, then the first 30 splits are
> all located on the same one or two nodes.
> I added Collections.shuffle(splits) before returning the splits in
> getSplits().  As a result, the load is much better distributed, throughput
>  was increased (about 3X in my case) and TimedOutExceptions were all but
> eliminated.
> Joost.



-- 
Jonathan Ellis
Project Chair, Apache Cassandra
co-founder of Riptano, the source for professional Cassandra support
http://riptano.com

Reply via email to