Seems like the hadoop Input format should combine the splits that are on the same node into the same map task, like Hadoop's CombinedInputFormat can. I am not sure who recommends vnodes as the default, because this is now the second problem (that I know of) of this class where vnodes has extra overhead, https://issues.apache.org/jira/browse/CASSANDRA-5161
This seems to be the standard operating practice in c* now, enable things in the default configuration like new partitioners and newer features like vnodes, even though they are not heavily tested in the wild or well understood, then deal with fallout. On Fri, Feb 15, 2013 at 11:52 AM, cem <cayiro...@gmail.com> wrote: > Hi All, > > I have just started to use virtual nodes. I set the number of nodes to 256 > as recommended. > > The problem that I have is when I run a mapreduce job it creates node * 256 > mappers. It creates node * 256 splits. this effects the performance since > the range queries have a lot of overhead. > > Any suggestion to improve the performance? It seems like I need to lower the > number of virtual nodes. > > Best Regards, > Cem > >