Can you tell me the size of table b? If you are doing bucketing and still size b table is huge then it will reach this problem
On Thu, Apr 5, 2012 at 4:22 PM, binhnt22 <binhn...@viettel.com.vn> wrote: > Thank Nitin,**** > > ** ** > > I tried but no luck. Here’s hive log, please spend a little time to view > it.**** > > ** ** > > hive> set hive.optimize.bucketmapjoin = true;**** > > hive> set hive.enforce.bucketing=true;**** > > hive> set > hive.input.format=org.apache.hadoop.hive.ql.io.CombineHiveInputFormat;**** > > hive> select /*+ MAPJOIN(b) */ * from ra_md_syn a join ra_ocs_syn b**** > > > on (a.calling = b.calling) where a.total_volume <> b.total_volume;* > *** > > Total MapReduce jobs = 1**** > > WARNING: org.apache.hadoop.metrics.jvm.EventCounter is deprecated. Please > use org.apache.hadoop.log.metrics.EventCounter in all the log4j.properties > files.**** > > Execution log at: > /tmp/hduser/hduser_20120405103737_28ef26fe-a202-4047-b5ca-c40d9e3ad36c.log > **** > > 2012-04-05 10:37:45 Starting to launch local task to process map > join; maximum memory = 1398145024**** > > 2012-04-05 10:37:48 Processing rows: 200000 Hashtable size: > 199999 Memory usage: 75403880 rate: 0.054**** > > 2012-04-05 10:37:50 Processing rows: 300000 Hashtable size: > 299999 Memory usage: 111404664 rate: 0.08**** > > 2012-04-05 10:37:54 Processing rows: 400000 Hashtable size: > 399999 Memory usage: 151598960 rate: 0.108**** > > 2012-04-05 10:38:04 Processing rows: 500000 Hashtable size: > 499999 Memory usage: 185483368 rate: 0.133**** > > 2012-04-05 10:38:09 Processing rows: 600000 Hashtable size: > 599999 Memory usage: 221483392 rate: 0.158**** > > 2012-04-05 10:38:13 Processing rows: 700000 Hashtable size: > 699999 Memory usage: 257482640 rate: 0.184**** > > 2012-04-05 10:38:19 Processing rows: 800000 Hashtable size: > 799999 Memory usage: 297676944 rate: 0.213**** > > 2012-04-05 10:38:22 Processing rows: 900000 Hashtable size: > 899999 Memory usage: 333676968 rate: 0.239**** > > 2012-04-05 10:38:27 Processing rows: 1000000 Hashtable size: > 999999 Memory usage: 369676944 rate: 0.264**** > > 2012-04-05 10:38:31 Processing rows: 1100000 Hashtable size: > 1099999 Memory usage: 405676968 rate: 0.29**** > > 2012-04-05 10:38:36 Processing rows: 1200000 Hashtable size: > 1199999 Memory usage: 441676944 rate: 0.316**** > > 2012-04-05 10:38:42 Processing rows: 1300000 Hashtable size: > 1299999 Memory usage: 477676944 rate: 0.342**** > > 2012-04-05 10:38:47 Processing rows: 1400000 Hashtable size: > 1399999 Memory usage: 513676968 rate: 0.367**** > > 2012-04-05 10:38:52 Processing rows: 1500000 Hashtable size: > 1499999 Memory usage: 549676944 rate: 0.393**** > > 2012-04-05 10:39:00 Processing rows: 1600000 Hashtable size: > 1599999 Memory usage: 602454200 rate: 0.431**** > > 2012-04-05 10:39:08 Processing rows: 1700000 Hashtable size: > 1699999 Memory usage: 630065552 rate: 0.451**** > > 2012-04-05 10:39:14 Processing rows: 1800000 Hashtable size: > 1799999 Memory usage: 666065552 rate: 0.476**** > > 2012-04-05 10:39:20 Processing rows: 1900000 Hashtable size: > 1899999 Memory usage: 702065552 rate: 0.502**** > > 2012-04-05 10:39:26 Processing rows: 2000000 Hashtable size: > 1999999 Memory usage: 738065576 rate: 0.528**** > > 2012-04-05 10:39:36 Processing rows: 2100000 Hashtable size: > 2099999 Memory usage: 774065552 rate: 0.554**** > > 2012-04-05 10:39:43 Processing rows: 2200000 Hashtable size: > 2199999 Memory usage: 810065552 rate: 0.579**** > > 2012-04-05 10:39:51 Processing rows: 2300000 Hashtable size: > 2299999 Memory usage: 846065576 rate: 0.605**** > > 2012-04-05 10:40:16 Processing rows: 2400000 Hashtable size: > 2399999 Memory usage: 882085136 rate: 0.631**** > > 2012-04-05 10:40:24 Processing rows: 2500000 Hashtable size: > 2499999 Memory usage: 918085208 rate: 0.657**** > > 2012-04-05 10:40:39 Processing rows: 2600000 Hashtable size: > 2599999 Memory usage: 954065544 rate: 0.682**** > > 2012-04-05 10:40:48 Processing rows: 2700000 Hashtable size: > 2699999 Memory usage: 990065568 rate: 0.708**** > > 2012-04-05 10:40:56 Processing rows: 2800000 Hashtable size: > 2799999 Memory usage: 1026065552 rate: 0.734**** > > 2012-04-05 10:41:07 Processing rows: 2900000 Hashtable size: > 2899999 Memory usage: 1062065576 rate: 0.76**** > > Exception in thread "Thread-1" java.lang.OutOfMemoryError: Java heap space > **** > > ** ** > > *Best regards* > > Nguyen Thanh Binh (Mr)**** > > Cell phone: (+84)98.226.0622**** > > ** ** > > *From:* Nitin Pawar [mailto:nitinpawar...@gmail.com] > *Sent:* Thursday, April 05, 2012 5:36 PM > > *To:* user@hive.apache.org > *Subject:* Re: Why BucketJoinMap consume too much memory**** > > ** ** > > can you try adding these settings **** > > set hive.enforce.bucketing=true;**** > > hive.input.format=org.apache.hadoop.hive.ql.io.CombineHiveInputFormat;**** > > ** ** > > I have tried bucketing with 1000 buckets and with more than 1TB data > tables .. they do go through fine **** > > ** ** > > ** ** > > On Thu, Apr 5, 2012 at 3:37 PM, binhnt22 <binhn...@viettel.com.vn> wrote:* > *** > > Hi Bejoy,**** > > **** > > Both my tables has 65m records ( ~ 1.8-1.9GB on hadoop) and bucketized on > ‘calling’ column into 10 buckets.**** > > **** > > As you said, hive will load only 1 bucket ~ 180-190MB into memory. That’s > hardly to blow the heap (1.3GB)**** > > **** > > According to wiki, I set:**** > > **** > > set > hive.input.format=org.apache.hadoop.hive.ql.io.BucketizedHiveInputFormat;* > *** > > set hive.optimize.bucketmapjoin = true;**** > > set hive.optimize.bucketmapjoin.sortedmerge = true;**** > > **** > > And run the following SQL**** > > **** > > select /*+ MAPJOIN(a) */ * from ra_md_cdr_ggsn_synthetic a join > ra_ocs_cdr_ggsn_synthetic b **** > > on (a.calling = b.calling) where a.total_volume <> b.total_volume;**** > > **** > > But it still created many hash tables then threw Java Heap space error**** > > **** > > *Best regards***** > > Nguyen Thanh Binh (Mr)**** > > Cell phone: (+84)98.226.0622**** > > **** > > *From:* Bejoy Ks [mailto:bejoy...@yahoo.com] > *Sent:* Thursday, April 05, 2012 3:07 PM > *To:* user@hive.apache.org**** > > > *Subject:* Re: Why BucketJoinMap consume too much memory**** > > **** > > Hi Amit**** > > **** > > Sorry for the delayed response, had a terrible schedule. AFAIK, > there is no flags that would help you to take the hash table creation, > compression and load into tmp files away from client node. **** > > From my understanding if you use a Map side join, the small table as > a whole is converted into a hash table and compressed in a tmp file. Say if > your child jvm size is 1gb and this small table is 5GB, it'd blow off jour > job if the map tasks tries to get such a huge file in memory. Bucketed map > join can help here, if the table is bucketed ,say 100 buckets then each > bucket may have around 50mb of data. ie one tmp file would be just less > that 50mb, here mapper needs to load only the required buckets > in memory and thus hardly run into memory issues.**** > > Also on the client, The records are processed bucket by bucket and > loaded into tmp files. So if your bucket size is too large, than the heap > size specified for your client, it'd throw an out of memory.**** > > **** > > Regards**** > > Bejoy KS**** > > **** > ------------------------------ > > *From:* Amit Sharma <amitsharma1...@gmail.com> > *To:* user@hive.apache.org; Bejoy Ks <bejoy...@yahoo.com> > *Sent:* Tuesday, April 3, 2012 11:06 PM > *Subject:* Re: Why BucketJoinMap consume too much memory**** > > ** ** > > I am experiencing similar behavior in my queries. All the conditions for > bucketed map join are met, and the only difference in execution when i set > the hive.optimize.bucketmapjoin flag to true, is that instead of a single > hash table, multiple hash tables are created. All the Hash Tables are still > created on the client side and loaded into tmp files, which are then > distributed to the mappers using distributed cache. > > Can i find any example anywhere, which shows behavior of bucketed map > join, where in it does not create the has tables on the client itself? If > so, is there a flag for it? > > Thanks, > Amit**** > > On Sun, Apr 1, 2012 at 12:35 PM, Bejoy Ks <bejoy...@yahoo.com> wrote:**** > > Hi > On a first look, it seems like map join is happening in your case > other than bucketed map join. The following conditions need to hold for > bucketed map join to work > 1) Both the tables are bucketed on the join columns > 2) The number of buckets in each table should be multiples of each other > 3) Ensure that the table has enough number of buckets > > Note: If the data is large say 1TB(per table) and if you have just a few > buckets say 100 buckets, each mapper may have to load 10GB>. This would > definitely blow your jvm . Bottom line is ensure your mappers are not > heavily loaded with the bucketed data distribution. > > Regards > Bejoy.K.S**** > ------------------------------ > > *From:* binhnt22 <binhn...@viettel.com.vn> > *To:* user@hive.apache.org > *Sent:* Saturday, March 31, 2012 6:46 AM > *Subject:* Why BucketJoinMap consume too much memory**** > > **** > > I have 2 table, each has 6 million records and clustered into 10 buckets* > *** > > **** > > These tables are very simple with 1 key column and 1 value column, all I > want is getting the key that exists in both table but different value.**** > > **** > > The normal did the trick, took only 141 secs.**** > > **** > > select * from ra_md_cdr_ggsn_synthetic a join ra_ocs_cdr_ggsn_synthetic b > on (a.calling = b.calling) where a.total_volume <> b.total_volume;**** > > **** > > I tried to use bucket join map by setting: *set > hive.optimize.bucketmapjoin = true***** > > **** > > select /*+ MAPJOIN(a) */ * from ra_md_cdr_ggsn_synthetic a join > ra_ocs_cdr_ggsn_synthetic b on (a.calling = b.calling) where > a.total_volume <> b.total_volume;**** > > **** > > 2012-03-30 11:35:09 Starting to launch local task to process map > join; maximum memory = 1398145024**** > > 2012-03-30 11:35:12 Processing rows: 200000 Hashtable size: > 199999 Memory usage: 86646704 rate: 0.062**** > > 2012-03-30 11:35:15 Processing rows: 300000 Hashtable size: > 299999 Memory usage: 128247464 rate: 0.092**** > > 2012-03-30 11:35:18 Processing rows: 400000 Hashtable size: > 399999 Memory usage: 174041744 rate: 0.124**** > > 2012-03-30 11:35:21 Processing rows: 500000 Hashtable size: > 499999 Memory usage: 214140840 rate: 0.153**** > > 2012-03-30 11:35:25 Processing rows: 600000 Hashtable size: > 599999 Memory usage: 255181504 rate: 0.183**** > > 2012-03-30 11:35:29 Processing rows: 700000 Hashtable size: > 699999 Memory usage: 296744320 rate: 0.212**** > > 2012-03-30 11:35:35 Processing rows: 800000 Hashtable size: > 799999 Memory usage: 342538616 rate: 0.245**** > > 2012-03-30 11:35:38 Processing rows: 900000 Hashtable size: > 899999 Memory usage: 384138552 rate: 0.275**** > > 2012-03-30 11:35:45 Processing rows: 1000000 Hashtable size: > 999999 Memory usage: 425719576 rate: 0.304**** > > 2012-03-30 11:35:50 Processing rows: 1100000 Hashtable size: > 1099999 Memory usage: 467319576 rate: 0.334**** > > 2012-03-30 11:35:56 Processing rows: 1200000 Hashtable size: > 1199999 Memory usage: 508940504 rate: 0.364**** > > 2012-03-30 11:36:04 Processing rows: 1300000 Hashtable size: > 1299999 Memory usage: 550521128 rate: 0.394**** > > 2012-03-30 11:36:09 Processing rows: 1400000 Hashtable size: > 1399999 Memory usage: 592121128 rate: 0.424**** > > 2012-03-30 11:36:15 Processing rows: 1500000 Hashtable size: > 1499999 Memory usage: 633720336 rate: 0.453**** > > 2012-03-30 11:36:22 Processing rows: 1600000 Hashtable size: > 1599999 Memory usage: 692097568 rate: 0.495**** > > 2012-03-30 11:36:33 Processing rows: 1700000 Hashtable size: > 1699999 Memory usage: 725308944 rate: 0.519**** > > 2012-03-30 11:36:40 Processing rows: 1800000 Hashtable size: > 1799999 Memory usage: 766946424 rate: 0.549**** > > 2012-03-30 11:36:48 Processing rows: 1900000 Hashtable size: > 1899999 Memory usage: 808527928 rate: 0.578**** > > 2012-03-30 11:36:55 Processing rows: 2000000 Hashtable size: > 1999999 Memory usage: 850127928 rate: 0.608**** > > 2012-03-30 11:37:08 Processing rows: 2100000 Hashtable size: > 2099999 Memory usage: 891708856 rate: 0.638**** > > 2012-03-30 11:37:16 Processing rows: 2200000 Hashtable size: > 2199999 Memory usage: 933308856 rate: 0.668**** > > 2012-03-30 11:37:25 Processing rows: 2300000 Hashtable size: > 2299999 Memory usage: 974908856 rate: 0.697**** > > 2012-03-30 11:37:34 Processing rows: 2400000 Hashtable size: > 2399999 Memory usage: 1016529448 rate: 0.727**** > > 2012-03-30 11:37:43 Processing rows: 2500000 Hashtable size: > 2499999 Memory usage: 1058129496 rate: 0.757**** > > 2012-03-30 11:37:58 Processing rows: 2600000 Hashtable size: > 2599999 Memory usage: 1099708832 rate: 0.787**** > > Exception in thread "Thread-1" java.lang.OutOfMemoryError: Java heap space > **** > > **** > > My system has 4 PC, each has CPU E2180, 2GB ram, 80GB HDD, one of them > containts NameNode, JobTracker, Hive Server and all of them contain > DataNode, TaskTracker**** > > **** > > In all node, I set: export HADOOP_HEAPSIZE=1500 in hadoop-env.sh (~ 1.3GB > heap)**** > > **** > > I want to ask you experts, why bucket join map consume too much memory? Am > I wrong or my configuration is bad?**** > > **** > > *Best regards,***** > > **** > > **** > > **** > > **** > > > > **** > > ** ** > > -- > Nitin Pawar**** > -- Nitin Pawar