Hi Experts I'm struck with a problem with merging the smaller output files produced as part of hive jobs. To test merging I did set the following parameters
set hive.merge.mapredfiles=true; set hive.merge.size.per.task=8000000; set hive.merge.smallfiles.avgsize=2000000; My understanding is that * every task would give me an output file size of atleast 8MB and * if the average size of final output files is less than the 'hive.merge.smallfiles.avgsize',here 2Mb then a merge job would be done(map only job).By average file size I'm under the assumption that it is calculated as the sum of all the file sizes divided by the number of files. But the output file sizes in the output dir doesn't get along with my findings. There are files with the following sizes Found 6 items 17946 2011-11-03 05:28 /u/bejoy/external_tables/test_table/dt=2011-10-02/timezone=EDT/000000_0 15951584 2011-11-03 05:28 /u/bejoy/external_tables/test_table/dt=2011-10-02/timezone=EDT/000001_0 131776 2011-11-03 05:28 /u/bejoy/external_tables/test_table/dt=2011-10-02/timezone=EDT/000002_0 7194653 2011-11-03 05:28 /u/bejoy/external_tables/test_table/dt=2011-10-02/timezone=EDT/000003_0 6434 2011-11-03 05:28 /u/bejoy/external_tables/test_table/dt=2011-10-02/timezone=EDT/000005_0 12697784 2011-11-03 05:28 /u/bejoy/external_tables/test_table/dt=2011-10-02/timezone=EDT/000007_0 The File sizes are varying from 15mb to 4 kb. Could some one help me out in understanding the merge logic and why I'm getting such varying file sizes. What I was aiming with this merge test was, in my output table sub directory(as my output table has multiple levels of partitions) I want to have files whose sizes are always greater than 8 MB. (Now I'm testing with 8mb but in real time production i need to chnage this value to 128 mb) .Also it'd be better if the files are of nearly equal sizes Am I on the right direction to achieve this goal? I tried setting a few other parameters along with the previous ones like -hiveconf mapred.min.split.size.per.node=8000000 -hiveconf mapred.min.split.size.per.rack=8000000 -hiveconf mapred.max.split.size=8000000 (FROM a recent JIRA for hive 0.8 we don't need to explicitly do so i believe - https://issues.apache.org/jira/browse/HIVE-2037) But it is still returning the same result, varying file sizes like above. I'm on hive 0.7 within CDHu0 environment (hive-hwi-0.7.0-cdh3u0.war). It would be great if some one could help me in understanding the concept of merging smaller files in hive map reduce tasks and guide me in accomplishing the desired results with the same. Thank you Regards, Bejoy.K.S