Hello, First of all, I would like to express my gratitude for your responses and assistance. I’m currently encountering a scenario where my Hive is not choosing BucketMapJoin, and I wonder whether this is due to its underlying execution engine, which is MapReduce.
In addition, I am operating in a test environment that uses three VMware virtual machines. The memory allocations for these machines are 4GB, 2GB, and 2GB, respectively. Of these, the 4GB virtual machine has been assigned two CPU cores, while the other two have a single CPU core each. I am unsure whether these factors could influence Hive’s decision to not select BucketMapJoin. Any guidance or insight on this matter would be greatly appreciated. Thanks Okumin <m...@okumin.com>于2023年6月26日 周一02:23写道: > Hi smart li, > > As far as I tried with Hive 3.1.2 on Tez, Bucket Map Join was probably > triggered. My configurations could be different from yours, though. > > # How I tested > > ## hive-site.xml > > > https://github.com/zookage/zookage/blob/v0.2.3/kubernetes/base/common/config/hive/hive-site.xml > > ## Prepare test data > > ``` > > zookage@client-node-0:~$ cat generate.sh > > #!/bin/bash > > > for i in {1..8000000}; do > > echo $i > > done > > zookage@client-node-0:~$ bash generate.sh > /tmp/ids.csv > > zookage@client-node-0:~$ hdfs dfs -copyFromLocal /tmp/ids.csv /tmp/ids.csv > > zookage@client-node-0:~$ hdfs dfs -tail hdfs:///tmp/ids.csv | tail -n 3 > > 7999998 > > 7999999 > > 8000000 > > zookage@client-node-0:~$ > ``` > > ## Load test data > > ``` > > zookage@client-node-0:~$ beeline -e " > > > create table map_join_tb(id int) > > > clustered by (id) into 32 buckets; > > > load data inpath '/tmp/ids.csv' into table map_join_tb; > > > " > > Connecting to > jdbc:hive2://hive-hiveserver2:10000/default;password=dummy;user=zookage > > Connected to: Apache Hive (version 3.1.2) > > ... > > zookage@client-node-0:~$ beeline -e "select min(id), max(id), count(*) > from map_join_tb" > > Connecting to > jdbc:hive2://hive-hiveserver2:10000/default;password=dummy;user=zookage > > ... > > +------+----------+----------+ > > | _c0 | _c1 | _c2 | > > +------+----------+----------+ > > | 1 | 8000000 | 8000000 | > > +------+----------+----------+ > > zookage@client-node-0:~$ hdfs dfs -ls /user/hive/warehouse/map_join_tb > > Found 32 items > > -rw-r--r-- 3 zookage hive 1965281 2023-06-25 16:55 > /user/hive/warehouse/map_join_tb/000000_0 > > -rw-r--r-- 3 zookage hive 1965275 2023-06-25 16:55 > /user/hive/warehouse/map_join_tb/000001_0 > > -rw-r--r-- 3 zookage hive 1965275 2023-06-25 16:55 > /user/hive/warehouse/map_join_tb/000002_0 > > ... > > -rw-r--r-- 3 zookage hive 1965279 2023-06-25 16:55 > /user/hive/warehouse/map_join_tb/000030_0 > > -rw-r--r-- 3 zookage hive 1965279 2023-06-25 16:55 > /user/hive/warehouse/map_join_tb/000031_0 > > zookage@client-node-0:~$ > ``` > # Submit a join query > > Bucket Map Join is chosen. If I remember correctly, we don't need > `hive.optimize.bucketmapjoin` or `hive.enforce.bucketmapjoin` for Tez > because `hive.convert.join.bucket.mapjoin.tez` is for Tez and it is true by > default. > > ``` > > zookage@client-node-0:~$ beeline -e "explain select * from map_join_tb a > join map_join_tb b on a.id=b.id;" > > Connecting to > jdbc:hive2://hive-hiveserver2:10000/default;password=dummy;user=zookage > > ... > > | Stage-1 | > > | Map 1 vectorized | > > | File Output Operator [FS_23] | > > | Map Join Operator [MAPJOIN_22] (rows=8800000 width=6) | > > | > BucketMapJoin:true,Conds:SEL_21._col0=RS_19._col0(Inner),HybridGraceHashJoin:true,Output:["_col0","_col1"] > | > > ... > > > ``` > > I guess some preconditions are different. You may also see the logs around > ConvertJoinMapJoin on HiveServer 2 but it is sometimes misleading in my > experience. > > - You applied some patches to your Hive? > - You use different configurations? > - Stats could be invalid? > > By the way, on my machine, Hive 3.1.2 with Bucket Map Join generated > incorrect rows. I have not taken a look at the reason(and it might be > possible that the root cause exists in my environment). > > ``` > > zookage@client-node-0:~$ beeline -e "select min(a.id), max(a.id), > count(*) from map_join_tb a join map_join_tb b on a.id=b.id;" > > Connecting to > jdbc:hive2://hive-hiveserver2:10000/default;password=dummy;user=zookage > > ... > > > ---------------------------------------------------------------------------------------------- > VERTICES MODE STATUS TOTAL COMPLETED RUNNING > PENDING FAILED KILLED > > ---------------------------------------------------------------------------------------------- > Map 3 .......... container SUCCEEDED 3 3 0 > 0 0 0 > Map 1 .......... container SUCCEEDED 32 32 0 > 0 0 0 > Reducer 2 ...... container SUCCEEDED 1 1 0 > 0 0 0 > > ---------------------------------------------------------------------------------------------- > VERTICES: 03/03 [==========================>>] 100% ELAPSED TIME: 17.36 > s > > ---------------------------------------------------------------------------------------------- > ... > +------+----------+---------+ > | _c0 | _c1 | _c2 | > +------+----------+---------+ > | 46 | 7999987 | 249776 | > +------+----------+---------+ > > ... > > zookage@client-node-0:~$ beeline -e "select min(a.id), max(a.id), > count(*) from map_join_tb a join map_join_tb b on a.id=b.id;" --hiveconf > hive.convert.join.bucket.mapjoin.tez=false > > Connecting to > jdbc:hive2://hive-hiveserver2:10000/default;password=dummy;user=zookage > > ... > > > ---------------------------------------------------------------------------------------------- > VERTICES MODE STATUS TOTAL COMPLETED RUNNING > PENDING FAILED KILLED > > ---------------------------------------------------------------------------------------------- > Map 1 .......... container SUCCEEDED 3 3 0 > 0 0 0 > Map 4 .......... container SUCCEEDED 3 3 0 > 0 0 0 > Reducer 2 ...... container SUCCEEDED 1 1 0 > 0 0 0 > Reducer 3 ...... container SUCCEEDED 1 1 0 > 0 0 0 > > ---------------------------------------------------------------------------------------------- > VERTICES: 04/04 [==========================>>] 100% ELAPSED TIME: 29.79 > s > > ---------------------------------------------------------------------------------------------- > ... > +------+----------+----------+ > | _c0 | _c1 | _c2 | > +------+----------+----------+ > | 1 | 8000000 | 8000000 | > +------+----------+----------+ > > ``` > > To be honest, I am not so familiar with this feature. Another expert might > give better insights to you. > > Thanks. > > > On Sun, Jun 25, 2023 at 8:43 PM smart li <smartli666...@gmail.com> wrote: > >> Hello Hive Users, >> >> I’m currently trying to understand how Bucket Map Join works in Hive, but >> I’m encountering some issues that I need help with. Here’s what I did: >> >> Firstly, I created a Hive table using the following statement: >> >> create table map_join_tb( >> id int >> ) >> clustered by (id) into 32 buckets; >> >> Then, I inserted 8 million rows of data into the table, with the ‘id’ >> field ranging from 1 to 8 million. After the data was bucketed, each bucket >> was approximately 2MB in size. >> >> I then set the following bucket map join configurations: >> >> set hive.optimize.bucketmapjoin=true; >> set hive.enforce.bucketmapjoin=true; >> >> Lastly, I ran an EXPLAIN on the following SQL: >> >> explain select * from map_join_tb a join map_join_tb b on a.id=b.id; >> >> Surprisingly, it seems that a Reduce Join, not a Bucket Map Join, was >> being performed according to the plan. I’m not sure why this is happening. >> Under what conditions does Hive decide to perform a Bucket Map Join? >> >> For reference, I am using Hive version 3.1.2. Any help or insights into >> this would be greatly appreciated. Thank you very much in advance. >> >> Best Regards, >> [smartli] >> >