I have 2 table, each has 6 million records and clustered into 10 buckets
These tables are very simple with 1 key column and 1 value column, all I want is getting the key that exists in both table but different value. The normal did the trick, took only 141 secs. select * from ra_md_cdr_ggsn_synthetic a join ra_ocs_cdr_ggsn_synthetic b on (a.calling = b.calling) where a.total_volume <> b.total_volume; I tried to use bucket join map by setting: set hive.optimize.bucketmapjoin = true select /*+ MAPJOIN(a) */ * from ra_md_cdr_ggsn_synthetic a join ra_ocs_cdr_ggsn_synthetic b on (a.calling = b.calling) where a.total_volume <> b.total_volume; 2012-03-30 11:35:09 Starting to launch local task to process map join; maximum memory = 1398145024 2012-03-30 11:35:12 Processing rows: 200000 Hashtable size: 199999 Memory usage: 86646704 rate: 0.062 2012-03-30 11:35:15 Processing rows: 300000 Hashtable size: 299999 Memory usage: 128247464 rate: 0.092 2012-03-30 11:35:18 Processing rows: 400000 Hashtable size: 399999 Memory usage: 174041744 rate: 0.124 2012-03-30 11:35:21 Processing rows: 500000 Hashtable size: 499999 Memory usage: 214140840 rate: 0.153 2012-03-30 11:35:25 Processing rows: 600000 Hashtable size: 599999 Memory usage: 255181504 rate: 0.183 2012-03-30 11:35:29 Processing rows: 700000 Hashtable size: 699999 Memory usage: 296744320 rate: 0.212 2012-03-30 11:35:35 Processing rows: 800000 Hashtable size: 799999 Memory usage: 342538616 rate: 0.245 2012-03-30 11:35:38 Processing rows: 900000 Hashtable size: 899999 Memory usage: 384138552 rate: 0.275 2012-03-30 11:35:45 Processing rows: 1000000 Hashtable size: 999999 Memory usage: 425719576 rate: 0.304 2012-03-30 11:35:50 Processing rows: 1100000 Hashtable size: 1099999 Memory usage: 467319576 rate: 0.334 2012-03-30 11:35:56 Processing rows: 1200000 Hashtable size: 1199999 Memory usage: 508940504 rate: 0.364 2012-03-30 11:36:04 Processing rows: 1300000 Hashtable size: 1299999 Memory usage: 550521128 rate: 0.394 2012-03-30 11:36:09 Processing rows: 1400000 Hashtable size: 1399999 Memory usage: 592121128 rate: 0.424 2012-03-30 11:36:15 Processing rows: 1500000 Hashtable size: 1499999 Memory usage: 633720336 rate: 0.453 2012-03-30 11:36:22 Processing rows: 1600000 Hashtable size: 1599999 Memory usage: 692097568 rate: 0.495 2012-03-30 11:36:33 Processing rows: 1700000 Hashtable size: 1699999 Memory usage: 725308944 rate: 0.519 2012-03-30 11:36:40 Processing rows: 1800000 Hashtable size: 1799999 Memory usage: 766946424 rate: 0.549 2012-03-30 11:36:48 Processing rows: 1900000 Hashtable size: 1899999 Memory usage: 808527928 rate: 0.578 2012-03-30 11:36:55 Processing rows: 2000000 Hashtable size: 1999999 Memory usage: 850127928 rate: 0.608 2012-03-30 11:37:08 Processing rows: 2100000 Hashtable size: 2099999 Memory usage: 891708856 rate: 0.638 2012-03-30 11:37:16 Processing rows: 2200000 Hashtable size: 2199999 Memory usage: 933308856 rate: 0.668 2012-03-30 11:37:25 Processing rows: 2300000 Hashtable size: 2299999 Memory usage: 974908856 rate: 0.697 2012-03-30 11:37:34 Processing rows: 2400000 Hashtable size: 2399999 Memory usage: 1016529448 rate: 0.727 2012-03-30 11:37:43 Processing rows: 2500000 Hashtable size: 2499999 Memory usage: 1058129496 rate: 0.757 2012-03-30 11:37:58 Processing rows: 2600000 Hashtable size: 2599999 Memory usage: 1099708832 rate: 0.787 Exception in thread "Thread-1" java.lang.OutOfMemoryError: Java heap space My system has 4 PC, each has CPU E2180, 2GB ram, 80GB HDD, one of them containts NameNode, JobTracker, Hive Server and all of them contain DataNode, TaskTracker In all node, I set: export HADOOP_HEAPSIZE=1500 in hadoop-env.sh (~ 1.3GB heap) I want to ask you experts, why bucket join map consume too much memory? Am I wrong or my configuration is bad? Best regards,