Hi smart li, As far as I tried with Hive 3.1.2 on Tez, Bucket Map Join was probably triggered. My configurations could be different from yours, though.
# How I tested ## hive-site.xml https://github.com/zookage/zookage/blob/v0.2.3/kubernetes/base/common/config/hive/hive-site.xml ## Prepare test data ``` zookage@client-node-0:~$ cat generate.sh #!/bin/bash for i in {1..8000000}; do echo $i done zookage@client-node-0:~$ bash generate.sh > /tmp/ids.csv zookage@client-node-0:~$ hdfs dfs -copyFromLocal /tmp/ids.csv /tmp/ids.csv zookage@client-node-0:~$ hdfs dfs -tail hdfs:///tmp/ids.csv | tail -n 3 7999998 7999999 8000000 zookage@client-node-0:~$ ``` ## Load test data ``` zookage@client-node-0:~$ beeline -e " > create table map_join_tb(id int) > clustered by (id) into 32 buckets; > load data inpath '/tmp/ids.csv' into table map_join_tb; > " Connecting to jdbc:hive2://hive-hiveserver2:10000/default;password=dummy;user=zookage Connected to: Apache Hive (version 3.1.2) ... zookage@client-node-0:~$ beeline -e "select min(id), max(id), count(*) from map_join_tb" Connecting to jdbc:hive2://hive-hiveserver2:10000/default;password=dummy;user=zookage ... +------+----------+----------+ | _c0 | _c1 | _c2 | +------+----------+----------+ | 1 | 8000000 | 8000000 | +------+----------+----------+ zookage@client-node-0:~$ hdfs dfs -ls /user/hive/warehouse/map_join_tb Found 32 items -rw-r--r-- 3 zookage hive 1965281 2023-06-25 16:55 /user/hive/warehouse/map_join_tb/000000_0 -rw-r--r-- 3 zookage hive 1965275 2023-06-25 16:55 /user/hive/warehouse/map_join_tb/000001_0 -rw-r--r-- 3 zookage hive 1965275 2023-06-25 16:55 /user/hive/warehouse/map_join_tb/000002_0 ... -rw-r--r-- 3 zookage hive 1965279 2023-06-25 16:55 /user/hive/warehouse/map_join_tb/000030_0 -rw-r--r-- 3 zookage hive 1965279 2023-06-25 16:55 /user/hive/warehouse/map_join_tb/000031_0 zookage@client-node-0:~$ ``` # Submit a join query Bucket Map Join is chosen. If I remember correctly, we don't need `hive.optimize.bucketmapjoin` or `hive.enforce.bucketmapjoin` for Tez because `hive.convert.join.bucket.mapjoin.tez` is for Tez and it is true by default. ``` zookage@client-node-0:~$ beeline -e "explain select * from map_join_tb a join map_join_tb b on a.id=b.id;" Connecting to jdbc:hive2://hive-hiveserver2:10000/default;password=dummy;user=zookage ... | Stage-1 | | Map 1 vectorized | | File Output Operator [FS_23] | | Map Join Operator [MAPJOIN_22] (rows=8800000 width=6) | | BucketMapJoin:true,Conds:SEL_21._col0=RS_19._col0(Inner),HybridGraceHashJoin:true,Output:["_col0","_col1"] | ... ``` I guess some preconditions are different. You may also see the logs around ConvertJoinMapJoin on HiveServer 2 but it is sometimes misleading in my experience. - You applied some patches to your Hive? - You use different configurations? - Stats could be invalid? By the way, on my machine, Hive 3.1.2 with Bucket Map Join generated incorrect rows. I have not taken a look at the reason(and it might be possible that the root cause exists in my environment). ``` zookage@client-node-0:~$ beeline -e "select min(a.id), max(a.id), count(*) from map_join_tb a join map_join_tb b on a.id=b.id;" Connecting to jdbc:hive2://hive-hiveserver2:10000/default;password=dummy;user=zookage ... ---------------------------------------------------------------------------------------------- VERTICES MODE STATUS TOTAL COMPLETED RUNNING PENDING FAILED KILLED ---------------------------------------------------------------------------------------------- Map 3 .......... container SUCCEEDED 3 3 0 0 0 0 Map 1 .......... container SUCCEEDED 32 32 0 0 0 0 Reducer 2 ...... container SUCCEEDED 1 1 0 0 0 0 ---------------------------------------------------------------------------------------------- VERTICES: 03/03 [==========================>>] 100% ELAPSED TIME: 17.36 s ---------------------------------------------------------------------------------------------- ... +------+----------+---------+ | _c0 | _c1 | _c2 | +------+----------+---------+ | 46 | 7999987 | 249776 | +------+----------+---------+ ... zookage@client-node-0:~$ beeline -e "select min(a.id), max(a.id), count(*) from map_join_tb a join map_join_tb b on a.id=b.id;" --hiveconf hive.convert.join.bucket.mapjoin.tez=false Connecting to jdbc:hive2://hive-hiveserver2:10000/default;password=dummy;user=zookage ... ---------------------------------------------------------------------------------------------- VERTICES MODE STATUS TOTAL COMPLETED RUNNING PENDING FAILED KILLED ---------------------------------------------------------------------------------------------- Map 1 .......... container SUCCEEDED 3 3 0 0 0 0 Map 4 .......... container SUCCEEDED 3 3 0 0 0 0 Reducer 2 ...... container SUCCEEDED 1 1 0 0 0 0 Reducer 3 ...... container SUCCEEDED 1 1 0 0 0 0 ---------------------------------------------------------------------------------------------- VERTICES: 04/04 [==========================>>] 100% ELAPSED TIME: 29.79 s ---------------------------------------------------------------------------------------------- ... +------+----------+----------+ | _c0 | _c1 | _c2 | +------+----------+----------+ | 1 | 8000000 | 8000000 | +------+----------+----------+ ``` To be honest, I am not so familiar with this feature. Another expert might give better insights to you. Thanks. On Sun, Jun 25, 2023 at 8:43 PM smart li <smartli666...@gmail.com> wrote: > Hello Hive Users, > > I’m currently trying to understand how Bucket Map Join works in Hive, but > I’m encountering some issues that I need help with. Here’s what I did: > > Firstly, I created a Hive table using the following statement: > > create table map_join_tb( > id int > ) > clustered by (id) into 32 buckets; > > Then, I inserted 8 million rows of data into the table, with the ‘id’ > field ranging from 1 to 8 million. After the data was bucketed, each bucket > was approximately 2MB in size. > > I then set the following bucket map join configurations: > > set hive.optimize.bucketmapjoin=true; > set hive.enforce.bucketmapjoin=true; > > Lastly, I ran an EXPLAIN on the following SQL: > > explain select * from map_join_tb a join map_join_tb b on a.id=b.id; > > Surprisingly, it seems that a Reduce Join, not a Bucket Map Join, was > being performed according to the plan. I’m not sure why this is happening. > Under what conditions does Hive decide to perform a Bucket Map Join? > > For reference, I am using Hive version 3.1.2. Any help or insights into > this would be greatly appreciated. Thank you very much in advance. > > Best Regards, > [smartli] >