I see. If you are using dynamic partitions, HIVE-1307 and HIVE-1622 need to be there for merging to take place. HIVE-1307 was committed to trunk on 08/25 and HIVE-1622 was committed on 09/13. The simplest way is to update your Hive trunk and rerun the query. If it still doesn't work maybe you can post your query and the result of 'explain <query>' and we can take a look.
Ning On Nov 18, 2010, at 2:57 PM, Leo Alekseyev wrote: > Hi Ning, > For the dataset I'm experimenting with, the total size of the output > is 2mb, and the files are at most a few kb in size. My > hive.input.format was set to default HiveInputFormat; however, when I > set it to CombineHiveInputFormat, it only made the first stage of the > job use fewer mappers. The merge job was *still* filtered out at > runtime. I also tried set hive.mergejob.maponly=false; that didn't > have any effect. > > I am a bit at a loss what to do here. Is there a way to see what's > going on exactly using e.g. debug log levels?.. Btw, I'm also using > dynamic partitions; could that somehow be interfering with the merge > job?.. > > I'm running a relatively fresh Hive from trunk (built maybe a month ago). > > --Leo > > On Thu, Nov 18, 2010 at 1:12 PM, Ning Zhang <nzh...@fb.com> wrote: >> The settings looks good. The parameter hive.merge.size.smallfiles.avgsize is >> used to determine at run time if a merge should be triggered: if the average >> size of the files in the partition is SMALLER than the parameter and there >> are more than 1 file, the merge should be scheduled. Can you try to see if >> you have any big files as well in your resulting partition? If it is because >> of a very large file, you can set the parameter large enough. >> >> Another possibility is that your Hadoop installation does not support >> CombineHiveInputFormat, which is used for the new merge job. Someone >> reported previously merge was not successful because of this. If that's the >> case, you can turn off CombineHiveInputFormat and use the old >> HiveInputFormat (though slower) by setting hive.mergejob.maponly=false. >> >> Ning >> On Nov 17, 2010, at 6:00 PM, Leo Alekseyev wrote: >> >>> I have jobs that sample (or generate) a small amount of data from a >>> large table. At the end, I get e.g. about 3000 or more files of 1kb >>> or so. This becomes a nuisance. How can I make Hive do another pass >>> to merge the output? I have the following settings: >>> >>> hive.merge.mapfiles=true >>> hive.merge.mapredfiles=true >>> hive.merge.size.per.task=256000000 >>> hive.merge.size.smallfiles.avgsize=16000000 >>> >>> After setting hive.merge* to true, Hive started indicating "Total >>> MapReduce jobs = 2". However, after generating the >>> lots-of-small-files table, Hive says: >>> Ended Job = job_201011021934_1344 >>> Ended Job = 781771542, job is filtered out (removed at runtime). >>> >>> Is there a way to force the merge, or am I missing something? >>> --Leo >> >>