Oh, very interesting. I assumed Hadoop would be smart enough to load-balance the jobs it sends out. Guess not.
Can you submit a patch? On Wed, May 12, 2010 at 12:32 PM, Joost Ouwerkerk <jo...@openplaces.org> wrote: > I've been trying to improve the time it takes to map 30 million rows using a > hadoop / cassandra cluster with 30 nodes. I discovered that since > CassandraInputFormat returns an ordered list of splits, when there are many > splits (e.g. hundreds or more) the load on cassandra is horribly unbalanced. > e.g. if I have 30 tasks processing 600 splits, then the first 30 splits are > all located on the same one or two nodes. > I added Collections.shuffle(splits) before returning the splits in > getSplits(). As a result, the load is much better distributed, throughput > was increased (about 3X in my case) and TimedOutExceptions were all but > eliminated. > Joost. -- Jonathan Ellis Project Chair, Apache Cassandra co-founder of Riptano, the source for professional Cassandra support http://riptano.com