can you try adding these settings set hive.enforce.bucketing=true; hive.input.format=org.apache.hadoop.hive.ql.io.CombineHiveInputFormat;
I have tried bucketing with 1000 buckets and with more than 1TB data tables .. they do go through fine On Thu, Apr 5, 2012 at 3:37 PM, binhnt22 <binhn...@viettel.com.vn> wrote: > Hi Bejoy,**** > > ** ** > > Both my tables has 65m records ( ~ 1.8-1.9GB on hadoop) and bucketized on > ‘calling’ column into 10 buckets.**** > > ** ** > > As you said, hive will load only 1 bucket ~ 180-190MB into memory. That’s > hardly to blow the heap (1.3GB)**** > > ** ** > > According to wiki, I set:**** > > ** ** > > set > hive.input.format=org.apache.hadoop.hive.ql.io.BucketizedHiveInputFormat;* > *** > > set hive.optimize.bucketmapjoin = true;**** > > set hive.optimize.bucketmapjoin.sortedmerge = true;**** > > ** ** > > And run the following SQL**** > > ** ** > > select /*+ MAPJOIN(a) */ * from ra_md_cdr_ggsn_synthetic a join > ra_ocs_cdr_ggsn_synthetic b **** > > on (a.calling = b.calling) where a.total_volume <> b.total_volume;**** > > ** ** > > But it still created many hash tables then threw Java Heap space error**** > > ** ** > > *Best regards* > > Nguyen Thanh Binh (Mr)**** > > Cell phone: (+84)98.226.0622**** > > ** ** > > *From:* Bejoy Ks [mailto:bejoy...@yahoo.com] > *Sent:* Thursday, April 05, 2012 3:07 PM > *To:* user@hive.apache.org > > *Subject:* Re: Why BucketJoinMap consume too much memory**** > > ** ** > > Hi Amit**** > > ** ** > > Sorry for the delayed response, had a terrible schedule. AFAIK, > there is no flags that would help you to take the hash table creation, > compression and load into tmp files away from client node. **** > > From my understanding if you use a Map side join, the small table as > a whole is converted into a hash table and compressed in a tmp file. Say if > your child jvm size is 1gb and this small table is 5GB, it'd blow off jour > job if the map tasks tries to get such a huge file in memory. Bucketed map > join can help here, if the table is bucketed ,say 100 buckets then each > bucket may have around 50mb of data. ie one tmp file would be just less > that 50mb, here mapper needs to load only the required buckets > in memory and thus hardly run into memory issues.**** > > Also on the client, The records are processed bucket by bucket and > loaded into tmp files. So if your bucket size is too large, than the heap > size specified for your client, it'd throw an out of memory.**** > > ** ** > > Regards**** > > Bejoy KS**** > > ** ** > ------------------------------ > > *From:* Amit Sharma <amitsharma1...@gmail.com> > *To:* user@hive.apache.org; Bejoy Ks <bejoy...@yahoo.com> > *Sent:* Tuesday, April 3, 2012 11:06 PM > *Subject:* Re: Why BucketJoinMap consume too much memory**** > > > > **** > > I am experiencing similar behavior in my queries. All the conditions for > bucketed map join are met, and the only difference in execution when i set > the hive.optimize.bucketmapjoin flag to true, is that instead of a single > hash table, multiple hash tables are created. All the Hash Tables are still > created on the client side and loaded into tmp files, which are then > distributed to the mappers using distributed cache. > > Can i find any example anywhere, which shows behavior of bucketed map > join, where in it does not create the has tables on the client itself? If > so, is there a flag for it? > > Thanks, > Amit**** > > On Sun, Apr 1, 2012 at 12:35 PM, Bejoy Ks <bejoy...@yahoo.com> wrote:**** > > Hi > On a first look, it seems like map join is happening in your case > other than bucketed map join. The following conditions need to hold for > bucketed map join to work > 1) Both the tables are bucketed on the join columns > 2) The number of buckets in each table should be multiples of each other > 3) Ensure that the table has enough number of buckets > > Note: If the data is large say 1TB(per table) and if you have just a few > buckets say 100 buckets, each mapper may have to load 10GB>. This would > definitely blow your jvm . Bottom line is ensure your mappers are not > heavily loaded with the bucketed data distribution. > > Regards > Bejoy.K.S**** > ------------------------------ > > *From:* binhnt22 <binhn...@viettel.com.vn> > *To:* user@hive.apache.org > *Sent:* Saturday, March 31, 2012 6:46 AM > *Subject:* Why BucketJoinMap consume too much memory**** > > ** ** > > I have 2 table, each has 6 million records and clustered into 10 buckets* > *** > > **** > > These tables are very simple with 1 key column and 1 value column, all I > want is getting the key that exists in both table but different value.**** > > **** > > The normal did the trick, took only 141 secs.**** > > **** > > select * from ra_md_cdr_ggsn_synthetic a join ra_ocs_cdr_ggsn_synthetic b > on (a.calling = b.calling) where a.total_volume <> b.total_volume;**** > > **** > > I tried to use bucket join map by setting: *set > hive.optimize.bucketmapjoin = true***** > > **** > > select /*+ MAPJOIN(a) */ * from ra_md_cdr_ggsn_synthetic a join > ra_ocs_cdr_ggsn_synthetic b on (a.calling = b.calling) where > a.total_volume <> b.total_volume;**** > > **** > > 2012-03-30 11:35:09 Starting to launch local task to process map > join; maximum memory = 1398145024**** > > 2012-03-30 11:35:12 Processing rows: 200000 Hashtable size: > 199999 Memory usage: 86646704 rate: 0.062**** > > 2012-03-30 11:35:15 Processing rows: 300000 Hashtable size: > 299999 Memory usage: 128247464 rate: 0.092**** > > 2012-03-30 11:35:18 Processing rows: 400000 Hashtable size: > 399999 Memory usage: 174041744 rate: 0.124**** > > 2012-03-30 11:35:21 Processing rows: 500000 Hashtable size: > 499999 Memory usage: 214140840 rate: 0.153**** > > 2012-03-30 11:35:25 Processing rows: 600000 Hashtable size: > 599999 Memory usage: 255181504 rate: 0.183**** > > 2012-03-30 11:35:29 Processing rows: 700000 Hashtable size: > 699999 Memory usage: 296744320 rate: 0.212**** > > 2012-03-30 11:35:35 Processing rows: 800000 Hashtable size: > 799999 Memory usage: 342538616 rate: 0.245**** > > 2012-03-30 11:35:38 Processing rows: 900000 Hashtable size: > 899999 Memory usage: 384138552 rate: 0.275**** > > 2012-03-30 11:35:45 Processing rows: 1000000 Hashtable size: > 999999 Memory usage: 425719576 rate: 0.304**** > > 2012-03-30 11:35:50 Processing rows: 1100000 Hashtable size: > 1099999 Memory usage: 467319576 rate: 0.334**** > > 2012-03-30 11:35:56 Processing rows: 1200000 Hashtable size: > 1199999 Memory usage: 508940504 rate: 0.364**** > > 2012-03-30 11:36:04 Processing rows: 1300000 Hashtable size: > 1299999 Memory usage: 550521128 rate: 0.394**** > > 2012-03-30 11:36:09 Processing rows: 1400000 Hashtable size: > 1399999 Memory usage: 592121128 rate: 0.424**** > > 2012-03-30 11:36:15 Processing rows: 1500000 Hashtable size: > 1499999 Memory usage: 633720336 rate: 0.453**** > > 2012-03-30 11:36:22 Processing rows: 1600000 Hashtable size: > 1599999 Memory usage: 692097568 rate: 0.495**** > > 2012-03-30 11:36:33 Processing rows: 1700000 Hashtable size: > 1699999 Memory usage: 725308944 rate: 0.519**** > > 2012-03-30 11:36:40 Processing rows: 1800000 Hashtable size: > 1799999 Memory usage: 766946424 rate: 0.549**** > > 2012-03-30 11:36:48 Processing rows: 1900000 Hashtable size: > 1899999 Memory usage: 808527928 rate: 0.578**** > > 2012-03-30 11:36:55 Processing rows: 2000000 Hashtable size: > 1999999 Memory usage: 850127928 rate: 0.608**** > > 2012-03-30 11:37:08 Processing rows: 2100000 Hashtable size: > 2099999 Memory usage: 891708856 rate: 0.638**** > > 2012-03-30 11:37:16 Processing rows: 2200000 Hashtable size: > 2199999 Memory usage: 933308856 rate: 0.668**** > > 2012-03-30 11:37:25 Processing rows: 2300000 Hashtable size: > 2299999 Memory usage: 974908856 rate: 0.697**** > > 2012-03-30 11:37:34 Processing rows: 2400000 Hashtable size: > 2399999 Memory usage: 1016529448 rate: 0.727**** > > 2012-03-30 11:37:43 Processing rows: 2500000 Hashtable size: > 2499999 Memory usage: 1058129496 rate: 0.757**** > > 2012-03-30 11:37:58 Processing rows: 2600000 Hashtable size: > 2599999 Memory usage: 1099708832 rate: 0.787**** > > Exception in thread "Thread-1" java.lang.OutOfMemoryError: Java heap space > **** > > **** > > My system has 4 PC, each has CPU E2180, 2GB ram, 80GB HDD, one of them > containts NameNode, JobTracker, Hive Server and all of them contain > DataNode, TaskTracker**** > > **** > > In all node, I set: export HADOOP_HEAPSIZE=1500 in hadoop-env.sh (~ 1.3GB > heap)**** > > **** > > I want to ask you experts, why bucket join map consume too much memory? Am > I wrong or my configuration is bad?**** > > **** > > *Best regards,***** > > **** > > ** ** > > ** ** > > ** ** > -- Nitin Pawar