Split size does not have to equal block size. http://hadoop.apache.org/docs/r1.1.1/api/org/apache/hadoop/mapred/lib/CombineFileInputFormat.html
An abstract InputFormat that returns CombineFileSplit's in InputFormat.getSplits(JobConf, int) method. Splits are constructed from the files under the input paths. A split cannot have files from different pools. Each split returned may contain blocks from different files. If a maxSplitSize is specified, then blocks on the same node are combined to form a single split. Blocks that are left over are then combined with other blocks in the same rack. If maxSplitSize is not specified, then blocks from the same rack are combined in a single split; no attempt is made to create node-local splits. If the maxSplitSize is equal to the block size, then this class is similar to the default spliting behaviour in Hadoop: each block is a locally processed split. Subclasses implement InputFormat.getRecordReader(InputSplit, JobConf, Reporter) to construct RecordReader's for CombineFileSplit's. Hive offers a CombinedHiveInputFormat https://issues.apache.org/jira/browse/HIVE-74 Essentially Combined input formats rock hard. If you have a directory with say 2000 files, you do not want 2000 splits, and then the overhead of starting stopping 2000 mappers. If you enable CombineInputFormat you can tune mapred.split.size and the number of mappers is based (mostly) on the input size. This gives jobs that would create too many map tasks way more throughput, and stops them from monopolizing the map slots on the cluster. It would seem like all the extra splits from the vnode change could be combined back together. On Sat, Feb 16, 2013 at 8:21 PM, Jonathan Ellis <jbel...@gmail.com> wrote: > Wouldn't you have more than 256 splits anyway, given a normal amount of data? > > (Default split size is 64k rows.) > > On Fri, Feb 15, 2013 at 7:01 PM, Edward Capriolo <edlinuxg...@gmail.com> > wrote: >> Seems like the hadoop Input format should combine the splits that are >> on the same node into the same map task, like Hadoop's >> CombinedInputFormat can. I am not sure who recommends vnodes as the >> default, because this is now the second problem (that I know of) of >> this class where vnodes has extra overhead, >> https://issues.apache.org/jira/browse/CASSANDRA-5161 >> >> This seems to be the standard operating practice in c* now, enable >> things in the default configuration like new partitioners and newer >> features like vnodes, even though they are not heavily tested in the >> wild or well understood, then deal with fallout. >> >> >> On Fri, Feb 15, 2013 at 11:52 AM, cem <cayiro...@gmail.com> wrote: >>> Hi All, >>> >>> I have just started to use virtual nodes. I set the number of nodes to 256 >>> as recommended. >>> >>> The problem that I have is when I run a mapreduce job it creates node * 256 >>> mappers. It creates node * 256 splits. this effects the performance since >>> the range queries have a lot of overhead. >>> >>> Any suggestion to improve the performance? It seems like I need to lower the >>> number of virtual nodes. >>> >>> Best Regards, >>> Cem >>> >>> > > > > -- > Jonathan Ellis > Project Chair, Apache Cassandra > co-founder, http://www.datastax.com > @spyced