Hi Bejoy,

 

Both my tables has 65m records ( ~ 1.8-1.9GB on hadoop) and bucketized on
'calling' column into 10 buckets.

 

As you said, hive will load only 1 bucket ~ 180-190MB into memory. That's
hardly to blow the heap (1.3GB)

 

According to wiki, I set:

 

  set
hive.input.format=org.apache.hadoop.hive.ql.io.BucketizedHiveInputFormat;

  set hive.optimize.bucketmapjoin = true;

  set hive.optimize.bucketmapjoin.sortedmerge = true;

 

And run the following SQL

 

select /*+ MAPJOIN(a) */ * from ra_md_cdr_ggsn_synthetic a join
ra_ocs_cdr_ggsn_synthetic b 

on (a.calling = b.calling) where  a.total_volume <> b.total_volume;

 

But it still created many hash tables then threw Java Heap space error

 

Best regards

Nguyen Thanh Binh (Mr)

Cell phone: (+84)98.226.0622

 

From: Bejoy Ks [mailto:bejoy...@yahoo.com] 
Sent: Thursday, April 05, 2012 3:07 PM
To: user@hive.apache.org
Subject: Re: Why BucketJoinMap consume too much memory

 

Hi Amit

 

      Sorry for the delayed response, had a terrible schedule. AFAIK, there
is no flags that would help you to take the hash table creation, compression
and load into tmp files away from client node. 

      From my understanding if you use a Map side join, the small table as a
whole is converted into a hash table and compressed in a tmp file. Say if
your child jvm size is 1gb and this small table is 5GB, it'd blow off jour
job if the map tasks tries to get such a huge file in memory. Bucketed map
join can help here, if the table is bucketed ,say 100 buckets then each
bucket may have around 50mb of data. ie one tmp file would be just less that
50mb, here mapper needs to load only the required buckets in memory and thus
hardly run into memory issues.

    Also on the client, The records are processed bucket by bucket and
loaded into tmp files. So if your bucket size is too large, than the heap
size specified for your client, it'd throw an out of memory.

 

Regards

Bejoy KS

 

  _____  

From: Amit Sharma <amitsharma1...@gmail.com>
To: user@hive.apache.org; Bejoy Ks <bejoy...@yahoo.com> 
Sent: Tuesday, April 3, 2012 11:06 PM
Subject: Re: Why BucketJoinMap consume too much memory





I am experiencing similar behavior in my queries. All the conditions for
bucketed map join are met, and the only difference in execution when i set
the hive.optimize.bucketmapjoin flag to true, is that instead of a single
hash table, multiple hash tables are created. All the Hash Tables are still
created on the client side and loaded into tmp files, which are then
distributed to the mappers using distributed cache.

Can i find any example anywhere, which shows behavior of bucketed map join,
where in it does not create the has tables on the client itself? If so, is
there a flag for it?

Thanks,
Amit

On Sun, Apr 1, 2012 at 12:35 PM, Bejoy Ks <bejoy...@yahoo.com> wrote:

Hi
    On a first look, it seems like map join is happening in your case other
than bucketed map join. The following conditions need to hold for bucketed
map join to work
1) Both the tables are bucketed on the join columns
2) The number of buckets in each table should be multiples of each other
3) Ensure that the table has enough number of buckets 

Note: If the data is large say 1TB(per table) and if you have just a few
buckets say 100 buckets, each mapper may have to load 10GB>. This would
definitely blow your jvm . Bottom line is ensure your mappers are not
heavily loaded with the bucketed data distribution.

Regards
Bejoy.K.S

  _____  

From: binhnt22 <binhn...@viettel.com.vn>
To: user@hive.apache.org 
Sent: Saturday, March 31, 2012 6:46 AM
Subject: Why BucketJoinMap consume too much memory

 

I  have 2 table, each has 6 million records and clustered into 10 buckets

 

These tables are very simple with 1 key column and 1 value column, all I
want is getting the key that exists in both table but different value.

 

The normal did the trick, took only 141 secs.

 

select * from ra_md_cdr_ggsn_synthetic a join ra_ocs_cdr_ggsn_synthetic b on
(a.calling = b.calling) where  a.total_volume <> b.total_volume;

 

I tried to use bucket join map by setting:   set hive.optimize.bucketmapjoin
= true

 

select /*+ MAPJOIN(a) */ * from ra_md_cdr_ggsn_synthetic a join
ra_ocs_cdr_ggsn_synthetic b on (a.calling = b.calling) where  a.total_volume
<> b.total_volume;

 

2012-03-30 11:35:09     Starting to launch local task to process map join;
maximum memory = 1398145024

2012-03-30 11:35:12     Processing rows:        200000  Hashtable size:
199999  Memory usage:   86646704        rate:   0.062

2012-03-30 11:35:15     Processing rows:        300000  Hashtable size:
299999  Memory usage:   128247464       rate:   0.092

2012-03-30 11:35:18     Processing rows:        400000  Hashtable size:
399999  Memory usage:   174041744       rate:   0.124

2012-03-30 11:35:21     Processing rows:        500000  Hashtable size:
499999  Memory usage:   214140840       rate:   0.153

2012-03-30 11:35:25     Processing rows:        600000  Hashtable size:
599999  Memory usage:   255181504       rate:   0.183

2012-03-30 11:35:29     Processing rows:        700000  Hashtable size:
699999  Memory usage:   296744320       rate:   0.212

2012-03-30 11:35:35     Processing rows:        800000  Hashtable size:
799999  Memory usage:   342538616       rate:   0.245

2012-03-30 11:35:38     Processing rows:        900000  Hashtable size:
899999  Memory usage:   384138552       rate:   0.275

2012-03-30 11:35:45     Processing rows:        1000000 Hashtable size:
999999  Memory usage:   425719576       rate:   0.304

2012-03-30 11:35:50     Processing rows:        1100000 Hashtable size:
1099999 Memory usage:   467319576       rate:   0.334

2012-03-30 11:35:56     Processing rows:        1200000 Hashtable size:
1199999 Memory usage:   508940504       rate:   0.364

2012-03-30 11:36:04     Processing rows:        1300000 Hashtable size:
1299999 Memory usage:   550521128       rate:   0.394

2012-03-30 11:36:09     Processing rows:        1400000 Hashtable size:
1399999 Memory usage:   592121128       rate:   0.424

2012-03-30 11:36:15     Processing rows:        1500000 Hashtable size:
1499999 Memory usage:   633720336       rate:   0.453

2012-03-30 11:36:22     Processing rows:        1600000 Hashtable size:
1599999 Memory usage:   692097568       rate:   0.495

2012-03-30 11:36:33     Processing rows:        1700000 Hashtable size:
1699999 Memory usage:   725308944       rate:   0.519

2012-03-30 11:36:40     Processing rows:        1800000 Hashtable size:
1799999 Memory usage:   766946424       rate:   0.549

2012-03-30 11:36:48     Processing rows:        1900000 Hashtable size:
1899999 Memory usage:   808527928       rate:   0.578

2012-03-30 11:36:55     Processing rows:        2000000 Hashtable size:
1999999 Memory usage:   850127928       rate:   0.608

2012-03-30 11:37:08     Processing rows:        2100000 Hashtable size:
2099999 Memory usage:   891708856       rate:   0.638

2012-03-30 11:37:16     Processing rows:        2200000 Hashtable size:
2199999 Memory usage:   933308856       rate:   0.668

2012-03-30 11:37:25     Processing rows:        2300000 Hashtable size:
2299999 Memory usage:   974908856       rate:   0.697

2012-03-30 11:37:34     Processing rows:        2400000 Hashtable size:
2399999 Memory usage:   1016529448      rate:   0.727

2012-03-30 11:37:43     Processing rows:        2500000 Hashtable size:
2499999 Memory usage:   1058129496      rate:   0.757

2012-03-30 11:37:58     Processing rows:        2600000 Hashtable size:
2599999 Memory usage:   1099708832      rate:   0.787

Exception in thread "Thread-1" java.lang.OutOfMemoryError: Java heap space

 

My system has 4 PC, each has CPU E2180, 2GB ram, 80GB HDD, one of them
containts NameNode, JobTracker, Hive Server and all of them contain
DataNode, TaskTracker

 

In all node, I set: export HADOOP_HEAPSIZE=1500 in hadoop-env.sh (~ 1.3GB
heap)

 

I want to ask you experts, why bucket join map consume too much memory? Am I
wrong or my configuration is bad?

 

Best regards,

 

 

 

 

Reply via email to