Re: Hive produces very small files despite hive.merge...=true settings

Ning Zhang Thu, 18 Nov 2010 15:45:09 -0800

I see. If you are using dynamic partitions, HIVE-1307 and HIVE-1622 need to be 
there for merging to take place. HIVE-1307 was committed to trunk on 08/25 and 
HIVE-1622 was committed on 09/13. The simplest way is to update your Hive trunk 
and rerun the query. If it still doesn't work maybe you can post your query and 
the result of 'explain <query>' and we can take a look.


Ning

On Nov 18, 2010, at 2:57 PM, Leo Alekseyev wrote:

> Hi Ning,
> For the dataset I'm experimenting with, the total size of the output
> is 2mb, and the files are at most a few kb in size.  My
> hive.input.format was set to default HiveInputFormat; however, when I
> set it to CombineHiveInputFormat, it only made the first stage of the
> job use fewer mappers.  The merge job was *still* filtered out at
> runtime.  I also tried set hive.mergejob.maponly=false; that didn't
> have any effect.
> 
> I am a bit at a loss what to do here.  Is there a way to see what's
> going on exactly using e.g. debug log levels?..  Btw, I'm also using
> dynamic partitions; could that somehow be interfering with the merge
> job?..
> 
> I'm running a relatively fresh Hive from trunk (built maybe a month ago).
> 
> --Leo
> 
> On Thu, Nov 18, 2010 at 1:12 PM, Ning Zhang <nzh...@fb.com> wrote:
>> The settings looks good. The parameter hive.merge.size.smallfiles.avgsize is 
>> used to determine at run time if a merge should be triggered: if the average 
>> size of the files in the partition is SMALLER than the parameter and there 
>> are more than 1 file, the merge should be scheduled. Can you try to see if 
>> you have any big files as well in your resulting partition? If it is because 
>> of a very large file, you can set the parameter large enough.
>> 
>> Another possibility is that your Hadoop installation does not support 
>> CombineHiveInputFormat, which is used for the new merge job. Someone 
>> reported previously merge was not successful because of this. If that's the 
>> case, you can turn off CombineHiveInputFormat and use the old 
>> HiveInputFormat (though slower) by setting hive.mergejob.maponly=false.
>> 
>> Ning
>> On Nov 17, 2010, at 6:00 PM, Leo Alekseyev wrote:
>> 
>>> I have jobs that sample (or generate) a small amount of data from a
>>> large table.  At the end, I get e.g. about 3000 or more files of 1kb
>>> or so.  This becomes a nuisance.  How can I make Hive do another pass
>>> to merge the output?  I have the following settings:
>>> 
>>> hive.merge.mapfiles=true
>>> hive.merge.mapredfiles=true
>>> hive.merge.size.per.task=256000000
>>> hive.merge.size.smallfiles.avgsize=16000000
>>> 
>>> After setting hive.merge* to true, Hive started indicating "Total
>>> MapReduce jobs = 2".  However, after generating the
>>> lots-of-small-files table, Hive says:
>>> Ended Job = job_201011021934_1344
>>> Ended Job = 781771542, job is filtered out (removed at runtime).
>>> 
>>> Is there a way to force the merge, or am I missing something?
>>> --Leo
>> 
>>

Re: Hive produces very small files despite hive.merge...=true settings

Reply via email to