I  have 2 table, each has 6 million records and clustered into 10 buckets

 

These tables are very simple with 1 key column and 1 value column, all I
want is getting the key that exists in both table but different value.

 

The normal did the trick, took only 141 secs.

 

select * from ra_md_cdr_ggsn_synthetic a join ra_ocs_cdr_ggsn_synthetic b on
(a.calling = b.calling) where  a.total_volume <> b.total_volume;

 

I tried to use bucket join map by setting:   set hive.optimize.bucketmapjoin
= true

 

select /*+ MAPJOIN(a) */ * from ra_md_cdr_ggsn_synthetic a join
ra_ocs_cdr_ggsn_synthetic b on (a.calling = b.calling) where  a.total_volume
<> b.total_volume;

 

2012-03-30 11:35:09     Starting to launch local task to process map join;
maximum memory = 1398145024

2012-03-30 11:35:12     Processing rows:        200000  Hashtable size:
199999  Memory usage:   86646704        rate:   0.062

2012-03-30 11:35:15     Processing rows:        300000  Hashtable size:
299999  Memory usage:   128247464       rate:   0.092

2012-03-30 11:35:18     Processing rows:        400000  Hashtable size:
399999  Memory usage:   174041744       rate:   0.124

2012-03-30 11:35:21     Processing rows:        500000  Hashtable size:
499999  Memory usage:   214140840       rate:   0.153

2012-03-30 11:35:25     Processing rows:        600000  Hashtable size:
599999  Memory usage:   255181504       rate:   0.183

2012-03-30 11:35:29     Processing rows:        700000  Hashtable size:
699999  Memory usage:   296744320       rate:   0.212

2012-03-30 11:35:35     Processing rows:        800000  Hashtable size:
799999  Memory usage:   342538616       rate:   0.245

2012-03-30 11:35:38     Processing rows:        900000  Hashtable size:
899999  Memory usage:   384138552       rate:   0.275

2012-03-30 11:35:45     Processing rows:        1000000 Hashtable size:
999999  Memory usage:   425719576       rate:   0.304

2012-03-30 11:35:50     Processing rows:        1100000 Hashtable size:
1099999 Memory usage:   467319576       rate:   0.334

2012-03-30 11:35:56     Processing rows:        1200000 Hashtable size:
1199999 Memory usage:   508940504       rate:   0.364

2012-03-30 11:36:04     Processing rows:        1300000 Hashtable size:
1299999 Memory usage:   550521128       rate:   0.394

2012-03-30 11:36:09     Processing rows:        1400000 Hashtable size:
1399999 Memory usage:   592121128       rate:   0.424

2012-03-30 11:36:15     Processing rows:        1500000 Hashtable size:
1499999 Memory usage:   633720336       rate:   0.453

2012-03-30 11:36:22     Processing rows:        1600000 Hashtable size:
1599999 Memory usage:   692097568       rate:   0.495

2012-03-30 11:36:33     Processing rows:        1700000 Hashtable size:
1699999 Memory usage:   725308944       rate:   0.519

2012-03-30 11:36:40     Processing rows:        1800000 Hashtable size:
1799999 Memory usage:   766946424       rate:   0.549

2012-03-30 11:36:48     Processing rows:        1900000 Hashtable size:
1899999 Memory usage:   808527928       rate:   0.578

2012-03-30 11:36:55     Processing rows:        2000000 Hashtable size:
1999999 Memory usage:   850127928       rate:   0.608

2012-03-30 11:37:08     Processing rows:        2100000 Hashtable size:
2099999 Memory usage:   891708856       rate:   0.638

2012-03-30 11:37:16     Processing rows:        2200000 Hashtable size:
2199999 Memory usage:   933308856       rate:   0.668

2012-03-30 11:37:25     Processing rows:        2300000 Hashtable size:
2299999 Memory usage:   974908856       rate:   0.697

2012-03-30 11:37:34     Processing rows:        2400000 Hashtable size:
2399999 Memory usage:   1016529448      rate:   0.727

2012-03-30 11:37:43     Processing rows:        2500000 Hashtable size:
2499999 Memory usage:   1058129496      rate:   0.757

2012-03-30 11:37:58     Processing rows:        2600000 Hashtable size:
2599999 Memory usage:   1099708832      rate:   0.787

Exception in thread "Thread-1" java.lang.OutOfMemoryError: Java heap space

 

My system has 4 PC, each has CPU E2180, 2GB ram, 80GB HDD, one of them
containts NameNode, JobTracker, Hive Server and all of them contain
DataNode, TaskTracker

 

In all node, I set: export HADOOP_HEAPSIZE=1500 in hadoop-env.sh (~ 1.3GB
heap)

 

I want to ask you experts, why bucket join map consume too much memory? Am I
wrong or my configuration is bad?

 

Best regards,

 

Reply via email to