Split size does not have to equal block size.

http://hadoop.apache.org/docs/r1.1.1/api/org/apache/hadoop/mapred/lib/CombineFileInputFormat.html

An abstract InputFormat that returns CombineFileSplit's in
InputFormat.getSplits(JobConf, int) method. Splits are constructed
from the files under the input paths. A split cannot have files from
different pools. Each split returned may contain blocks from different
files. If a maxSplitSize is specified, then blocks on the same node
are combined to form a single split. Blocks that are left over are
then combined with other blocks in the same rack. If maxSplitSize is
not specified, then blocks from the same rack are combined in a single
split; no attempt is made to create node-local splits. If the
maxSplitSize is equal to the block size, then this class is similar to
the default spliting behaviour in Hadoop: each block is a locally
processed split. Subclasses implement
InputFormat.getRecordReader(InputSplit, JobConf, Reporter) to
construct RecordReader's for CombineFileSplit's.

Hive offers a CombinedHiveInputFormat

https://issues.apache.org/jira/browse/HIVE-74

Essentially Combined input formats rock hard. If you have a directory
with say 2000 files, you do not want 2000 splits, and then the
overhead of starting stopping 2000 mappers.

If you enable CombineInputFormat you can tune mapred.split.size and
the number of mappers is based (mostly) on the input size. This gives
jobs that would create too many map tasks way more throughput, and
stops them from monopolizing the map slots on the cluster.

It would seem like all the extra splits from the vnode change could be
combined back together.

On Sat, Feb 16, 2013 at 8:21 PM, Jonathan Ellis <jbel...@gmail.com> wrote:
> Wouldn't you have more than 256 splits anyway, given a normal amount of data?
>
> (Default split size is 64k rows.)
>
> On Fri, Feb 15, 2013 at 7:01 PM, Edward Capriolo <edlinuxg...@gmail.com> 
> wrote:
>> Seems like the hadoop Input format should combine the splits that are
>> on the same node into the same map task, like Hadoop's
>> CombinedInputFormat can. I am not sure who recommends vnodes as the
>> default, because this is now the second problem (that I know of) of
>> this class where vnodes has extra overhead,
>> https://issues.apache.org/jira/browse/CASSANDRA-5161
>>
>> This seems to be the standard operating practice in c* now, enable
>> things in the default configuration like new partitioners and newer
>> features like vnodes, even though they are not heavily tested in the
>> wild or well understood, then deal with fallout.
>>
>>
>> On Fri, Feb 15, 2013 at 11:52 AM, cem <cayiro...@gmail.com> wrote:
>>> Hi All,
>>>
>>> I have just started to use virtual nodes. I set the number of nodes to 256
>>> as recommended.
>>>
>>> The problem that I have is when I run a mapreduce job it creates node * 256
>>> mappers. It creates node * 256 splits. this effects the performance since
>>> the range queries have a lot of overhead.
>>>
>>> Any suggestion to improve the performance? It seems like I need to lower the
>>> number of virtual nodes.
>>>
>>> Best Regards,
>>> Cem
>>>
>>>
>
>
>
> --
> Jonathan Ellis
> Project Chair, Apache Cassandra
> co-founder, http://www.datastax.com
> @spyced

Reply via email to