The settings looks good. The parameter hive.merge.size.smallfiles.avgsize is used to determine at run time if a merge should be triggered: if the average size of the files in the partition is SMALLER than the parameter and there are more than 1 file, the merge should be scheduled. Can you try to see if you have any big files as well in your resulting partition? If it is because of a very large file, you can set the parameter large enough.
Another possibility is that your Hadoop installation does not support CombineHiveInputFormat, which is used for the new merge job. Someone reported previously merge was not successful because of this. If that's the case, you can turn off CombineHiveInputFormat and use the old HiveInputFormat (though slower) by setting hive.mergejob.maponly=false. Ning On Nov 17, 2010, at 6:00 PM, Leo Alekseyev wrote: > I have jobs that sample (or generate) a small amount of data from a > large table. At the end, I get e.g. about 3000 or more files of 1kb > or so. This becomes a nuisance. How can I make Hive do another pass > to merge the output? I have the following settings: > > hive.merge.mapfiles=true > hive.merge.mapredfiles=true > hive.merge.size.per.task=256000000 > hive.merge.size.smallfiles.avgsize=16000000 > > After setting hive.merge* to true, Hive started indicating "Total > MapReduce jobs = 2". However, after generating the > lots-of-small-files table, Hive says: > Ended Job = job_201011021934_1344 > Ended Job = 781771542, job is filtered out (removed at runtime). > > Is there a way to force the merge, or am I missing something? > --Leo