RE: Reduce number of Hadoop mappers for large number of GZ files

Loudongfeng Mon, 04 Apr 2016 23:33:07 -0700

You can set hive.hadoop.supports.splittable.combineinputformat=true to combine 
your files.
In fact ,this parameter should be set to true by default since MAPREDUCE-1597 
was fixed is hadoop 0.22.0 long ago.

From: Harshit Sharan [mailto:hsincredi...@gmail.com]
Sent: Saturday, April 02, 2016 4:06 PM
To: user@hive.apache.org
Subject: Reduce number of Hadoop mappers for large number of GZ files

Hi,

I have a use case where I have 3072 gz files over which I am building a HIVE 
table. Now, whenever I run a query over this table, the query spawns 3072 
mappers, and takes around 44 mins to complete. Earlier, the same data (i.e. 
equal data size) was present in 384 files. The same queries took around 9 mins 
only.

I searched the web, where I found that the number of mappers are decided by the 
number of "splits" of the i/p data. Hence, setting the parameters: 
mapreduce.input.fileinputformat.split.minsize and 
mapreduce.input.fileinputformat.split.maxsize

to a high value like 64 MB would cause each mapper to take up 64 MB worth of 
data, even if that requires processing multiple files by same mapper.

But, this solution doesn't work for my case, since GZ files are of 
"non-splittable" format. Hence, they can not be split across mappers or joined 
to be processed by a single mapper.

Has anyone faced this problem too?

There can be various solutions to this, like uncompressing the gz files and 
then using above params to have lesser number of mappers, or using higher end 
ec2 instances to reduce processing time. But, is there an inherent solution in 
Hadoop/Hive/EMR to tackle this?

Thanks in advance for any help!
--
Regards,
Harshit Sharan
Software Development Engineer

RE: Reduce number of Hadoop mappers for large number of GZ files

Reply via email to