On Fri, Oct 22, 2010 at 3:30 AM, Takayuki Tsunakawa <tsunakawa.ta...@jp.fujitsu.com> wrote: > Yes, I meant one map task would be sent to each task tracker, resulting in > 1,000 concurrent map tasks in the cluster. ColumnFamilyInputFormat cannot > identify the nodes that actually hold some data, so the job tracker will > send the map tasks to all of the 1,000 nodes. This is wasteful and > time-consuming if only 200 nodes hold some data for a keyspace.
(a) Normally all data from each keyspace is spread around each node in the cluster. This is what you want for best parallelism. (b) Cassandra generates input splits from the sampling of keys each node has in memory. So if a node does end up with no data for a keyspace (because of bad OOP balancing for instance) it will have no splits generated or mapped. -- Jonathan Ellis Project Chair, Apache Cassandra co-founder of Riptano, the source for professional Cassandra support http://riptano.com