Re: Why BucketJoinMap consume too much memory

Nitin Pawar Thu, 05 Apr 2012 03:36:43 -0700

can you try adding these settings
set hive.enforce.bucketing=true;
hive.input.format=org.apache.hadoop.hive.ql.io.CombineHiveInputFormat;


I have tried bucketing with 1000 buckets and with more than 1TB data tables
.. they do go through fine



On Thu, Apr 5, 2012 at 3:37 PM, binhnt22 <binhn...@viettel.com.vn> wrote:

>  Hi Bejoy,****
>
> ** **
>
> Both my tables has 65m records ( ~ 1.8-1.9GB on hadoop) and bucketized on
> ‘calling’ column into 10 buckets.****
>
> ** **
>
> As you said, hive will load only 1 bucket ~ 180-190MB into memory. That’s
> hardly to blow the heap (1.3GB)****
>
> ** **
>
> According to wiki, I set:****
>
> ** **
>
>   set
> hive.input.format=org.apache.hadoop.hive.ql.io.BucketizedHiveInputFormat;*
> ***
>
>   set hive.optimize.bucketmapjoin = true;****
>
>   set hive.optimize.bucketmapjoin.sortedmerge = true;****
>
> ** **
>
> And run the following SQL****
>
> ** **
>
> select /*+ MAPJOIN(a) */ * from ra_md_cdr_ggsn_synthetic a join
> ra_ocs_cdr_ggsn_synthetic b ****
>
> on (a.calling = b.calling) where  a.total_volume <> b.total_volume;****
>
> ** **
>
> But it still created many hash tables then threw Java Heap space error****
>
> ** **
>
> *Best regards*
>
> Nguyen Thanh Binh (Mr)****
>
> Cell phone: (+84)98.226.0622****
>
> ** **
>
> *From:* Bejoy Ks [mailto:bejoy...@yahoo.com]
> *Sent:* Thursday, April 05, 2012 3:07 PM
> *To:* user@hive.apache.org
>
> *Subject:* Re: Why BucketJoinMap consume too much memory****
>
>  ** **
>
> Hi Amit****
>
> ** **
>
>       Sorry for the delayed response, had a terrible schedule. AFAIK,
> there is no flags that would help you to take the hash table creation,
> compression and load into tmp files away from client node. ****
>
>       From my understanding if you use a Map side join, the small table as
> a whole is converted into a hash table and compressed in a tmp file. Say if
> your child jvm size is 1gb and this small table is 5GB, it'd blow off jour
> job if the map tasks tries to get such a huge file in memory. Bucketed map
> join can help here, if the table is bucketed ,say 100 buckets then each
> bucket may have around 50mb of data. ie one tmp file would be just less
> that 50mb, here mapper needs to load only the required buckets
> in memory and thus hardly run into memory issues.****
>
>     Also on the client, The records are processed bucket by bucket and
> loaded into tmp files. So if your bucket size is too large, than the heap
> size specified for your client, it'd throw an out of memory.****
>
> ** **
>
> Regards****
>
> Bejoy KS****
>
> ** **
>    ------------------------------
>
> *From:* Amit Sharma <amitsharma1...@gmail.com>
> *To:* user@hive.apache.org; Bejoy Ks <bejoy...@yahoo.com>
> *Sent:* Tuesday, April 3, 2012 11:06 PM
> *Subject:* Re: Why BucketJoinMap consume too much memory****
>
>
>
> ****
>
> I am experiencing similar behavior in my queries. All the conditions for
> bucketed map join are met, and the only difference in execution when i set
> the hive.optimize.bucketmapjoin flag to true, is that instead of a single
> hash table, multiple hash tables are created. All the Hash Tables are still
> created on the client side and loaded into tmp files, which are then
> distributed to the mappers using distributed cache.
>
> Can i find any example anywhere, which shows behavior of bucketed map
> join, where in it does not create the has tables on the client itself? If
> so, is there a flag for it?
>
> Thanks,
> Amit****
>
> On Sun, Apr 1, 2012 at 12:35 PM, Bejoy Ks <bejoy...@yahoo.com> wrote:****
>
> Hi
>     On a first look, it seems like map join is happening in your case
> other than bucketed map join. The following conditions need to hold for
> bucketed map join to work
> 1) Both the tables are bucketed on the join columns
> 2) The number of buckets in each table should be multiples of each other
> 3) Ensure that the table has enough number of buckets
>
> Note: If the data is large say 1TB(per table) and if you have just a few
> buckets say 100 buckets, each mapper may have to load 10GB>. This would
> definitely blow your jvm . Bottom line is ensure your mappers are not
> heavily loaded with the bucketed data distribution.
>
> Regards
> Bejoy.K.S****
>   ------------------------------
>
> *From:* binhnt22 <binhn...@viettel.com.vn>
> *To:* user@hive.apache.org
> *Sent:* Saturday, March 31, 2012 6:46 AM
> *Subject:* Why BucketJoinMap consume too much memory****
>
> ** **
>
> I  have 2 table, each has 6 million records and clustered into 10 buckets*
> ***
>
>  ****
>
> These tables are very simple with 1 key column and 1 value column, all I
> want is getting the key that exists in both table but different value.****
>
>  ****
>
> The normal did the trick, took only 141 secs.****
>
>  ****
>
> select * from ra_md_cdr_ggsn_synthetic a join ra_ocs_cdr_ggsn_synthetic b
> on (a.calling = b.calling) where  a.total_volume <> b.total_volume;****
>
>  ****
>
> I tried to use bucket join map by setting:   *set
> hive.optimize.bucketmapjoin = true*****
>
>  ****
>
> select /*+ MAPJOIN(a) */ * from ra_md_cdr_ggsn_synthetic a join
> ra_ocs_cdr_ggsn_synthetic b on (a.calling = b.calling) where
> a.total_volume <> b.total_volume;****
>
>  ****
>
> 2012-03-30 11:35:09     Starting to launch local task to process map
> join;      maximum memory = 1398145024****
>
> 2012-03-30 11:35:12     Processing rows:        200000  Hashtable size:
> 199999  Memory usage:   86646704        rate:   0.062****
>
> 2012-03-30 11:35:15     Processing rows:        300000  Hashtable size:
> 299999  Memory usage:   128247464       rate:   0.092****
>
> 2012-03-30 11:35:18     Processing rows:        400000  Hashtable size:
> 399999  Memory usage:   174041744       rate:   0.124****
>
> 2012-03-30 11:35:21     Processing rows:        500000  Hashtable size:
> 499999  Memory usage:   214140840       rate:   0.153****
>
> 2012-03-30 11:35:25     Processing rows:        600000  Hashtable size:
> 599999  Memory usage:   255181504       rate:   0.183****
>
> 2012-03-30 11:35:29     Processing rows:        700000  Hashtable size:
> 699999  Memory usage:   296744320       rate:   0.212****
>
> 2012-03-30 11:35:35     Processing rows:        800000  Hashtable size:
> 799999  Memory usage:   342538616       rate:   0.245****
>
> 2012-03-30 11:35:38     Processing rows:        900000  Hashtable size:
> 899999  Memory usage:   384138552       rate:   0.275****
>
> 2012-03-30 11:35:45     Processing rows:        1000000 Hashtable size:
> 999999  Memory usage:   425719576       rate:   0.304****
>
> 2012-03-30 11:35:50     Processing rows:        1100000 Hashtable size:
> 1099999 Memory usage:   467319576       rate:   0.334****
>
> 2012-03-30 11:35:56     Processing rows:        1200000 Hashtable size:
> 1199999 Memory usage:   508940504       rate:   0.364****
>
> 2012-03-30 11:36:04     Processing rows:        1300000 Hashtable size:
> 1299999 Memory usage:   550521128       rate:   0.394****
>
> 2012-03-30 11:36:09     Processing rows:        1400000 Hashtable size:
> 1399999 Memory usage:   592121128       rate:   0.424****
>
> 2012-03-30 11:36:15     Processing rows:        1500000 Hashtable size:
> 1499999 Memory usage:   633720336       rate:   0.453****
>
> 2012-03-30 11:36:22     Processing rows:        1600000 Hashtable size:
> 1599999 Memory usage:   692097568       rate:   0.495****
>
> 2012-03-30 11:36:33     Processing rows:        1700000 Hashtable size:
> 1699999 Memory usage:   725308944       rate:   0.519****
>
> 2012-03-30 11:36:40     Processing rows:        1800000 Hashtable size:
> 1799999 Memory usage:   766946424       rate:   0.549****
>
> 2012-03-30 11:36:48     Processing rows:        1900000 Hashtable size:
> 1899999 Memory usage:   808527928       rate:   0.578****
>
> 2012-03-30 11:36:55     Processing rows:        2000000 Hashtable size:
> 1999999 Memory usage:   850127928       rate:   0.608****
>
> 2012-03-30 11:37:08     Processing rows:        2100000 Hashtable size:
> 2099999 Memory usage:   891708856       rate:   0.638****
>
> 2012-03-30 11:37:16     Processing rows:        2200000 Hashtable size:
> 2199999 Memory usage:   933308856       rate:   0.668****
>
> 2012-03-30 11:37:25     Processing rows:        2300000 Hashtable size:
> 2299999 Memory usage:   974908856       rate:   0.697****
>
> 2012-03-30 11:37:34     Processing rows:        2400000 Hashtable size:
> 2399999 Memory usage:   1016529448      rate:   0.727****
>
> 2012-03-30 11:37:43     Processing rows:        2500000 Hashtable size:
> 2499999 Memory usage:   1058129496      rate:   0.757****
>
> 2012-03-30 11:37:58     Processing rows:        2600000 Hashtable size:
> 2599999 Memory usage:   1099708832      rate:   0.787****
>
> Exception in thread "Thread-1" java.lang.OutOfMemoryError: Java heap space
> ****
>
>  ****
>
> My system has 4 PC, each has CPU E2180, 2GB ram, 80GB HDD, one of them
> containts NameNode, JobTracker, Hive Server and all of them contain
> DataNode, TaskTracker****
>
>  ****
>
> In all node, I set: export HADOOP_HEAPSIZE=1500 in hadoop-env.sh (~ 1.3GB
> heap)****
>
>  ****
>
> I want to ask you experts, why bucket join map consume too much memory? Am
> I wrong or my configuration is bad?****
>
>  ****
>
> *Best regards,*****
>
>  ****
>
> ** **
>
> ** **
>
> ** **
>



-- 
Nitin Pawar

Re: Why BucketJoinMap consume too much memory

Reply via email to