Hive merge map reduce files - need help in understanding the parameters and full flow

Bejoy Ks Thu, 03 Nov 2011 04:20:33 -0700

Hi Experts
       I'm struck with a problem with merging the smaller output files produced 
as part of hive jobs. To test  merging I did set the following parameters


set hive.merge.mapredfiles=true;
set hive.merge.size.per.task=8000000;
set hive.merge.smallfiles.avgsize=2000000;

My understanding is that 

        * every task would give me an output file size of atleast 8MB and
        *  if the average size of final output files is less than the 
'hive.merge.smallfiles.avgsize',here 2Mb then a merge job would be done(map 
only job).By average file size I'm under the assumption that it is calculated 
as the sum of all the file sizes divided by the number of files.

But the output file sizes in the output dir doesn't get along with my findings. 
There are files with the following sizes
Found 6 items
17946 2011-11-03 05:28 
/u/bejoy/external_tables/test_table/dt=2011-10-02/timezone=EDT/000000_0
15951584 2011-11-03 05:28 
/u/bejoy/external_tables/test_table/dt=2011-10-02/timezone=EDT/000001_0
131776 2011-11-03 05:28 
/u/bejoy/external_tables/test_table/dt=2011-10-02/timezone=EDT/000002_0
7194653 2011-11-03 05:28 
/u/bejoy/external_tables/test_table/dt=2011-10-02/timezone=EDT/000003_0
6434 2011-11-03 05:28 
/u/bejoy/external_tables/test_table/dt=2011-10-02/timezone=EDT/000005_0
12697784 2011-11-03 05:28 
/u/bejoy/external_tables/test_table/dt=2011-10-02/timezone=EDT/000007_0

The File sizes are varying from 15mb to 4 kb. Could some one help me out in 
understanding the merge logic and why I'm getting such varying file sizes.

What I was aiming  with this merge test was, in my output table sub 
directory(as my output table has multiple levels of partitions) I want to have 
files whose sizes are always greater than 8 MB. 
(Now I'm testing with 8mb but in real time production i need to chnage this 
value to 128 mb)
 .Also it'd be better if the files are of nearly equal sizes
Am I on the right direction to achieve this goal?

I tried setting a few other parameters along with the previous ones like
-hiveconf mapred.min.split.size.per.node=8000000 
-hiveconf mapred.min.split.size.per.rack=8000000 
-hiveconf mapred.max.split.size=8000000 
(FROM a recent JIRA for hive 0.8 we don't need to explicitly do so i believe - 
https://issues.apache.org/jira/browse/HIVE-2037)
But it is still returning the same result, varying file sizes like above.

I'm on hive 0.7 within CDHu0 environment (hive-hwi-0.7.0-cdh3u0.war).

It would be great if some one could help me in understanding the concept of 
merging smaller files in hive map reduce tasks and guide me in accomplishing 
the desired results with the same. 

Thank you

Regards,
Bejoy.K.S

Hive merge map reduce files - need help in understanding the parameters and full flow

Reply via email to