You can set hive.hadoop.supports.splittable.combineinputformat=true to combine your files. In fact ,this parameter should be set to true by default since MAPREDUCE-1597 was fixed is hadoop 0.22.0 long ago.
From: Harshit Sharan [mailto:hsincredi...@gmail.com] Sent: Saturday, April 02, 2016 4:06 PM To: user@hive.apache.org Subject: Reduce number of Hadoop mappers for large number of GZ files Hi, I have a use case where I have 3072 gz files over which I am building a HIVE table. Now, whenever I run a query over this table, the query spawns 3072 mappers, and takes around 44 mins to complete. Earlier, the same data (i.e. equal data size) was present in 384 files. The same queries took around 9 mins only. I searched the web, where I found that the number of mappers are decided by the number of "splits" of the i/p data. Hence, setting the parameters: mapreduce.input.fileinputformat.split.minsize and mapreduce.input.fileinputformat.split.maxsize to a high value like 64 MB would cause each mapper to take up 64 MB worth of data, even if that requires processing multiple files by same mapper. But, this solution doesn't work for my case, since GZ files are of "non-splittable" format. Hence, they can not be split across mappers or joined to be processed by a single mapper. Has anyone faced this problem too? There can be various solutions to this, like uncompressing the gz files and then using above params to have lesser number of mappers, or using higher end ec2 instances to reduce processing time. But, is there an inherent solution in Hadoop/Hive/EMR to tackle this? Thanks in advance for any help! -- Regards, Harshit Sharan Software Development Engineer