Re: Why BucketJoinMap consume too much memory

Bejoy Ks Thu, 05 Apr 2012 01:07:49 -0700

Hi Amit

      Sorry for the delayed response, had a terrible schedule. AFAIK, there is 
no flags that would help you to take the hash table creation, compression and 
load into tmp files away from client node. 
      From my understanding if you use a Map side join, the small table as a 
whole is converted into a hash table and compressed in a tmp file. Say if your 
child jvm size is 1gb and this small table is 5GB, it'd blow off jour job if 
the map tasks tries to get such a huge file in memory. Bucketed map join can 
help here, if the table is bucketed ,say 100 buckets then each bucket may have 
around 50mb of data. ie one tmp file would be just less that 50mb, here mapper 
needs to load only the required buckets in memory and thus hardly run 
into memory issues.
    Also on the client, The records are processed bucket by bucket and loaded 
into tmp files. So if your bucket size is too large, than the heap size 
specified for your client, it'd throw an out of memory.


Regards
Bejoy KS


________________________________
 From: Amit Sharma <amitsharma1...@gmail.com>
To: user@hive.apache.org; Bejoy Ks <bejoy...@yahoo.com> 
Sent: Tuesday, April 3, 2012 11:06 PM
Subject: Re: Why BucketJoinMap consume too much memory
 

I am experiencing similar behavior in my queries. All the conditions for 
bucketed map join are met, and the only difference in execution when i set the 
hive.optimize.bucketmapjoin flag to true, is that instead of a single hash 
table, multiple hash tables are created. All the Hash Tables are still created 
on the client side and loaded into tmp files, which are then distributed to the 
mappers using distributed cache.

Can i find any example anywhere, which shows behavior of bucketed map join, 
where in it does not create the has tables on the client itself? If so, is 
there a flag for it?

Thanks,
Amit


On Sun, Apr 1, 2012 at 12:35 PM, Bejoy Ks <bejoy...@yahoo.com> wrote:

Hi
>    On a first look, it seems like map join is happening in your case other 
>than bucketed map join. The following conditions need to hold for bucketed map 
>join to work
>1) Both the tables are bucketed on the join columns
>2) The number of buckets in each table should be multiples of each other
>3) Ensure that the table has enough number of buckets 
>
>Note: If the data is large say 1TB(per table) and if you have just a few 
>buckets say 100 buckets, each mapper may have to load 10GB>. This would 
>definitely blow your jvm . Bottom line is ensure your mappers are not heavily 
>loaded with the bucketed data distribution.
>
>Regards
>Bejoy.K.S
>
>
>________________________________
> From: binhnt22 <binhn...@viettel.com.vn>
>To: user@hive.apache.org 
>Sent: Saturday, March 31, 2012 6:46 AM
>Subject: Why BucketJoinMap consume too much memory
> 
>
>
>I 
have 2 table, each has 6 million records and clustered into 10 buckets
> 
>These
tables are very simple with 1 key column and 1 value column, all I want is
getting the key that exists in both table but different value.
> 
>The
normal did the trick, took only 141 secs.
> 
>select * from ra_md_cdr_ggsn_synthetic a
join ra_ocs_cdr_ggsn_synthetic b on (a.calling = b.calling) where 
a.total_volume <> b.total_volume;
> 
>I tried to use bucket join map by
setting:   set
hive.optimize.bucketmapjoin = true
> 
>select /*+ MAPJOIN(a) */ * from
ra_md_cdr_ggsn_synthetic a join ra_ocs_cdr_ggsn_synthetic b on (a.calling =
b.calling) where  a.total_volume <> b.total_volume;
> 
>2012-03-30
11:35:09     Starting to launch local task to process map
join;      maximum memory = 1398145024
>2012-03-30
11:35:12     Processing
rows:        200000  Hashtable size:
199999  Memory usage:  
86646704        rate:   0.062
>2012-03-30
11:35:15     Processing
rows:        300000  Hashtable size:
299999  Memory usage:   128247464      
rate:   0.092
>2012-03-30
11:35:18     Processing
rows:        400000  Hashtable size:
399999  Memory usage:  
174041744       rate:   0.124
>2012-03-30
11:35:21     Processing
rows:        500000  Hashtable size:
499999  Memory usage:   214140840      
rate:   0.153
>2012-03-30
11:35:25     Processing
rows:        600000  Hashtable size:
599999  Memory usage:  
255181504       rate:   0.183
>2012-03-30
11:35:29     Processing
rows:        700000  Hashtable size:
699999  Memory usage:   296744320      
rate:   0.212
>2012-03-30
11:35:35     Processing
rows:        800000  Hashtable size:
799999  Memory usage:  
342538616       rate:   0.245
>2012-03-30
11:35:38     Processing
rows:        900000  Hashtable size:
899999  Memory usage:   384138552      
rate:   0.275
>2012-03-30
11:35:45     Processing
rows:        1000000 Hashtable size:
999999  Memory usage:  
425719576       rate:   0.304
>2012-03-30
11:35:50     Processing
rows:        1100000 Hashtable size: 1099999
Memory usage:   467319576       rate:  
0.334
>2012-03-30
11:35:56     Processing
rows:        1200000 Hashtable size: 1199999
Memory usage:   508940504      
rate:   0.364
>2012-03-30
11:36:04     Processing
rows:        1300000 Hashtable size: 1299999
Memory usage:   550521128      
rate:   0.394
>2012-03-30
11:36:09     Processing
rows:        1400000 Hashtable size: 1399999
Memory usage:   592121128      
rate:   0.424
>2012-03-30
11:36:15     Processing
rows:        1500000 Hashtable size: 1499999
Memory usage:   633720336      
rate:   0.453
>2012-03-30
11:36:22     Processing
rows:        1600000 Hashtable size: 1599999
Memory usage:   692097568      
rate:   0.495
>2012-03-30
11:36:33     Processing
rows:        1700000 Hashtable size: 1699999
Memory usage:   725308944      
rate:   0.519
>2012-03-30
11:36:40     Processing
rows:        1800000 Hashtable size: 1799999
Memory usage:   766946424      
rate:   0.549
>2012-03-30
11:36:48     Processing
rows:        1900000 Hashtable size: 1899999
Memory usage:   808527928      
rate:   0.578
>2012-03-30
11:36:55     Processing
rows:        2000000 Hashtable size: 1999999
Memory usage:   850127928      
rate:   0.608
>2012-03-30
11:37:08     Processing
rows:        2100000 Hashtable size: 2099999
Memory usage:   891708856      
rate:   0.638
>2012-03-30
11:37:16     Processing
rows:        2200000 Hashtable size: 2199999
Memory usage:   933308856      
rate:   0.668
>2012-03-30
11:37:25     Processing
rows:        2300000 Hashtable size: 2299999
Memory usage:   974908856      
rate:   0.697
>2012-03-30
11:37:34     Processing
rows:        2400000 Hashtable size: 2399999
Memory usage:   1016529448     
rate:   0.727
>2012-03-30
11:37:43     Processing
rows:        2500000 Hashtable size: 2499999
Memory usage:   1058129496     
rate:   0.757
>2012-03-30
11:37:58     Processing
rows:        2600000 Hashtable size: 2599999
Memory usage:   1099708832     
rate:   0.787
>Exception
in thread "Thread-1" java.lang.OutOfMemoryError: Java heap space
> 
>My
system has 4 PC, each has CPU E2180, 2GB ram, 80GB HDD, one of them containts
NameNode, JobTracker, Hive Server and all of them contain DataNode, TaskTracker
> 
>In all
node, I set: export
HADOOP_HEAPSIZE=1500 in
hadoop-env.sh (~ 1.3GB heap)
> 
>I want
to ask you experts, why bucket join map consume too much memory? Am I wrong or
my configuration is bad?
> 
>Best regards,
> 
>
>

Re: Why BucketJoinMap consume too much memory

Reply via email to