Hi
    On a first look, it seems like map join is happening in your case other 
than bucketed map join. The following conditions need to hold for bucketed map 
join to work
1) Both the tables are bucketed on the join columns
2) The number of buckets in each table should be multiples of each other
3) Ensure that the table has enough number of buckets 

Note: If the data is large say 1TB(per table) and if you have just a few 
buckets say 100 buckets, each mapper may have to load 10GB>. This would 
definitely blow your jvm . Bottom line is ensure your mappers are not heavily 
loaded with the bucketed data distribution.

Regards
Bejoy.K.S


________________________________
 From: binhnt22 <binhn...@viettel.com.vn>
To: user@hive.apache.org 
Sent: Saturday, March 31, 2012 6:46 AM
Subject: Why BucketJoinMap consume too much memory
 

 
I 
have 2 table, each has 6 million records and clustered into 10 buckets
 
These
tables are very simple with 1 key column and 1 value column, all I want is
getting the key that exists in both table but different value.
 
The
normal did the trick, took only 141 secs.
 
select * from ra_md_cdr_ggsn_synthetic a
join ra_ocs_cdr_ggsn_synthetic b on (a.calling = b.calling) where 
a.total_volume <> b.total_volume;
 
I tried to use bucket join map by
setting:   set
hive.optimize.bucketmapjoin = true
 
select /*+ MAPJOIN(a) */ * from
ra_md_cdr_ggsn_synthetic a join ra_ocs_cdr_ggsn_synthetic b on (a.calling =
b.calling) where  a.total_volume <> b.total_volume;
 
2012-03-30
11:35:09     Starting to launch local task to process map
join;      maximum memory = 1398145024
2012-03-30
11:35:12     Processing
rows:        200000  Hashtable size:
199999  Memory usage:  
86646704        rate:   0.062
2012-03-30
11:35:15     Processing
rows:        300000  Hashtable size:
299999  Memory usage:   128247464      
rate:   0.092
2012-03-30
11:35:18     Processing
rows:        400000  Hashtable size:
399999  Memory usage:  
174041744       rate:   0.124
2012-03-30
11:35:21     Processing
rows:        500000  Hashtable size:
499999  Memory usage:   214140840      
rate:   0.153
2012-03-30
11:35:25     Processing
rows:        600000  Hashtable size:
599999  Memory usage:  
255181504       rate:   0.183
2012-03-30
11:35:29     Processing
rows:        700000  Hashtable size:
699999  Memory usage:   296744320      
rate:   0.212
2012-03-30
11:35:35     Processing
rows:        800000  Hashtable size:
799999  Memory usage:  
342538616       rate:   0.245
2012-03-30
11:35:38     Processing
rows:        900000  Hashtable size:
899999  Memory usage:   384138552      
rate:   0.275
2012-03-30
11:35:45     Processing
rows:        1000000 Hashtable size:
999999  Memory usage:  
425719576       rate:   0.304
2012-03-30
11:35:50     Processing
rows:        1100000 Hashtable size: 1099999
Memory usage:   467319576       rate:  
0.334
2012-03-30
11:35:56     Processing
rows:        1200000 Hashtable size: 1199999
Memory usage:   508940504      
rate:   0.364
2012-03-30
11:36:04     Processing
rows:        1300000 Hashtable size: 1299999
Memory usage:   550521128      
rate:   0.394
2012-03-30
11:36:09     Processing
rows:        1400000 Hashtable size: 1399999
Memory usage:   592121128      
rate:   0.424
2012-03-30
11:36:15     Processing
rows:        1500000 Hashtable size: 1499999
Memory usage:   633720336      
rate:   0.453
2012-03-30
11:36:22     Processing
rows:        1600000 Hashtable size: 1599999
Memory usage:   692097568      
rate:   0.495
2012-03-30
11:36:33     Processing
rows:        1700000 Hashtable size: 1699999
Memory usage:   725308944      
rate:   0.519
2012-03-30
11:36:40     Processing
rows:        1800000 Hashtable size: 1799999
Memory usage:   766946424      
rate:   0.549
2012-03-30
11:36:48     Processing
rows:        1900000 Hashtable size: 1899999
Memory usage:   808527928      
rate:   0.578
2012-03-30
11:36:55     Processing
rows:        2000000 Hashtable size: 1999999
Memory usage:   850127928      
rate:   0.608
2012-03-30
11:37:08     Processing
rows:        2100000 Hashtable size: 2099999
Memory usage:   891708856      
rate:   0.638
2012-03-30
11:37:16     Processing
rows:        2200000 Hashtable size: 2199999
Memory usage:   933308856      
rate:   0.668
2012-03-30
11:37:25     Processing
rows:        2300000 Hashtable size: 2299999
Memory usage:   974908856      
rate:   0.697
2012-03-30
11:37:34     Processing
rows:        2400000 Hashtable size: 2399999
Memory usage:   1016529448     
rate:   0.727
2012-03-30
11:37:43     Processing
rows:        2500000 Hashtable size: 2499999
Memory usage:   1058129496     
rate:   0.757
2012-03-30
11:37:58     Processing
rows:        2600000 Hashtable size: 2599999
Memory usage:   1099708832     
rate:   0.787
Exception
in thread "Thread-1" java.lang.OutOfMemoryError: Java heap space
 
My
system has 4 PC, each has CPU E2180, 2GB ram, 80GB HDD, one of them containts
NameNode, JobTracker, Hive Server and all of them contain DataNode, TaskTracker
 
In all
node, I set: export
HADOOP_HEAPSIZE=1500 in
hadoop-env.sh (~ 1.3GB heap)
 
I want
to ask you experts, why bucket join map consume too much memory? Am I wrong or
my configuration is bad?
 
Best regards,

Reply via email to