I have jobs that sample (or generate) a small amount of data from a large table. At the end, I get e.g. about 3000 or more files of 1kb or so. This becomes a nuisance. How can I make Hive do another pass to merge the output? I have the following settings:
hive.merge.mapfiles=true hive.merge.mapredfiles=true hive.merge.size.per.task=256000000 hive.merge.size.smallfiles.avgsize=16000000 After setting hive.merge* to true, Hive started indicating "Total MapReduce jobs = 2". However, after generating the lots-of-small-files table, Hive says: Ended Job = job_201011021934_1344 Ended Job = 781771542, job is filtered out (removed at runtime). Is there a way to force the merge, or am I missing something? --Leo