Re: Hive produces very small files despite hive.merge...=true settings

Ning Zhang Thu, 18 Nov 2010 13:13:24 -0800

The settings looks good. The parameter hive.merge.size.smallfiles.avgsize is 
used to determine at run time if a merge should be triggered: if the average 
size of the files in the partition is SMALLER than the parameter and there are 
more than 1 file, the merge should be scheduled. Can you try to see if you have 
any big files as well in your resulting partition? If it is because of a very 
large file, you can set the parameter large enough.

Another possibility is that your Hadoop installation does not support 
CombineHiveInputFormat, which is used for the new merge job. Someone reported 
previously merge was not successful because of this. If that's the case, you 
can turn off CombineHiveInputFormat and use the old HiveInputFormat (though 
slower) by setting hive.mergejob.maponly=false. 

Ning
On Nov 17, 2010, at 6:00 PM, Leo Alekseyev wrote:

> I have jobs that sample (or generate) a small amount of data from a
> large table.  At the end, I get e.g. about 3000 or more files of 1kb
> or so.  This becomes a nuisance.  How can I make Hive do another pass
> to merge the output?  I have the following settings:
> 
> hive.merge.mapfiles=true
> hive.merge.mapredfiles=true
> hive.merge.size.per.task=256000000
> hive.merge.size.smallfiles.avgsize=16000000
> 
> After setting hive.merge* to true, Hive started indicating "Total
> MapReduce jobs = 2".  However, after generating the
> lots-of-small-files table, Hive says:
> Ended Job = job_201011021934_1344
> Ended Job = 781771542, job is filtered out (removed at runtime).
> 
> Is there a way to force the merge, or am I missing something?
> --Leo

Re: Hive produces very small files despite hive.merge...=true settings

Reply via email to