I've been trying to improve the time it takes to map 30 million rows using a hadoop / cassandra cluster with 30 nodes. I discovered that since CassandraInputFormat returns an ordered list of splits, when there are many splits (e.g. hundreds or more) the load on cassandra is horribly unbalanced. e.g. if I have 30 tasks processing 600 splits, then the first 30 splits are all located on the same one or two nodes.
I added *Collections.shuffle(splits) *before returning the splits in getSplits(). As a result, the load is much better distributed, throughput was increased (about 3X in my case) and TimedOutExceptions were all but eliminated. Joost.