Hi, I understand you are trying MapReduce! I recommend you use Tez unless you have special reasons. Tez is the recommended engine and I guess more community members use Hive 3 on Tez. It means you are more likely to get answers when you encounter trouble.
Quickly, I succeeded in enabling Bucket Map Join with Hive on MR in the following settings. Looks like, Bucket Map Join on MR requires the original plan to have Map Join, so I added a hint with `hive.ignore.mapjoin.hint=false`. I added `hive.cbo.enable=false` with my esper. I'm not sure why my environment needed it. ``` $ beeline -e "explain extended select /*+ MAPJOIN(b) */ * from map_join_tb a join map_join_tb b on a.id=b.id;" --hiveconf hive.optimize.bucketmapjoin=true --hiveconf hive.enforce.bucketing=true --hiveconf hive.ignore.mapjoin.hint=false --hiveconf hive.cbo.enable=false ... | Map Operator Tree: | | TableScan | | alias: a | ... | Map Join Operator | ... | BucketMapJoin: true ``` Regards, Okumin On Mon, Jun 26, 2023 at 8:30 AM smart li <smartli666...@gmail.com> wrote: > Hello, > > First of all, I would like to express my gratitude for your responses and > assistance. I’m currently encountering a scenario where my Hive is not > choosing BucketMapJoin, and I wonder whether this is due to its underlying > execution engine, which is MapReduce. > > In addition, I am operating in a test environment that uses three VMware > virtual machines. The memory allocations for these machines are 4GB, 2GB, > and 2GB, respectively. Of these, the 4GB virtual machine has been assigned > two CPU cores, while the other two have a single CPU core each. I am unsure > whether these factors could influence Hive’s decision to not select > BucketMapJoin. > > Any guidance or insight on this matter would be greatly appreciated. > > Thanks > > Okumin <m...@okumin.com>于2023年6月26日 周一02:23写道: > >> Hi smart li, >> >> As far as I tried with Hive 3.1.2 on Tez, Bucket Map Join was probably >> triggered. My configurations could be different from yours, though. >> >> # How I tested >> >> ## hive-site.xml >> >> >> https://github.com/zookage/zookage/blob/v0.2.3/kubernetes/base/common/config/hive/hive-site.xml >> >> ## Prepare test data >> >> ``` >> >> zookage@client-node-0:~$ cat generate.sh >> >> #!/bin/bash >> >> >> for i in {1..8000000}; do >> >> echo $i >> >> done >> >> zookage@client-node-0:~$ bash generate.sh > /tmp/ids.csv >> >> zookage@client-node-0:~$ hdfs dfs -copyFromLocal /tmp/ids.csv >> /tmp/ids.csv >> >> zookage@client-node-0:~$ hdfs dfs -tail hdfs:///tmp/ids.csv | tail -n 3 >> >> 7999998 >> >> 7999999 >> >> 8000000 >> >> zookage@client-node-0:~$ >> ``` >> >> ## Load test data >> >> ``` >> >> zookage@client-node-0:~$ beeline -e " >> >> > create table map_join_tb(id int) >> >> > clustered by (id) into 32 buckets; >> >> > load data inpath '/tmp/ids.csv' into table map_join_tb; >> >> > " >> >> Connecting to >> jdbc:hive2://hive-hiveserver2:10000/default;password=dummy;user=zookage >> >> Connected to: Apache Hive (version 3.1.2) >> >> ... >> >> zookage@client-node-0:~$ beeline -e "select min(id), max(id), count(*) >> from map_join_tb" >> >> Connecting to >> jdbc:hive2://hive-hiveserver2:10000/default;password=dummy;user=zookage >> >> ... >> >> +------+----------+----------+ >> >> | _c0 | _c1 | _c2 | >> >> +------+----------+----------+ >> >> | 1 | 8000000 | 8000000 | >> >> +------+----------+----------+ >> >> zookage@client-node-0:~$ hdfs dfs -ls /user/hive/warehouse/map_join_tb >> >> Found 32 items >> >> -rw-r--r-- 3 zookage hive 1965281 2023-06-25 16:55 >> /user/hive/warehouse/map_join_tb/000000_0 >> >> -rw-r--r-- 3 zookage hive 1965275 2023-06-25 16:55 >> /user/hive/warehouse/map_join_tb/000001_0 >> >> -rw-r--r-- 3 zookage hive 1965275 2023-06-25 16:55 >> /user/hive/warehouse/map_join_tb/000002_0 >> >> ... >> >> -rw-r--r-- 3 zookage hive 1965279 2023-06-25 16:55 >> /user/hive/warehouse/map_join_tb/000030_0 >> >> -rw-r--r-- 3 zookage hive 1965279 2023-06-25 16:55 >> /user/hive/warehouse/map_join_tb/000031_0 >> >> zookage@client-node-0:~$ >> ``` >> # Submit a join query >> >> Bucket Map Join is chosen. If I remember correctly, we don't need >> `hive.optimize.bucketmapjoin` or `hive.enforce.bucketmapjoin` for Tez >> because `hive.convert.join.bucket.mapjoin.tez` is for Tez and it is true by >> default. >> >> ``` >> >> zookage@client-node-0:~$ beeline -e "explain select * from map_join_tb a >> join map_join_tb b on a.id=b.id;" >> >> Connecting to >> jdbc:hive2://hive-hiveserver2:10000/default;password=dummy;user=zookage >> >> ... >> >> | Stage-1 | >> >> | Map 1 vectorized | >> >> | File Output Operator [FS_23] | >> >> | Map Join Operator [MAPJOIN_22] (rows=8800000 width=6) | >> >> | >> BucketMapJoin:true,Conds:SEL_21._col0=RS_19._col0(Inner),HybridGraceHashJoin:true,Output:["_col0","_col1"] >> | >> >> ... >> >> >> ``` >> >> I guess some preconditions are different. You may also see the logs >> around ConvertJoinMapJoin on HiveServer 2 but it is sometimes misleading in >> my experience. >> >> - You applied some patches to your Hive? >> - You use different configurations? >> - Stats could be invalid? >> >> By the way, on my machine, Hive 3.1.2 with Bucket Map Join generated >> incorrect rows. I have not taken a look at the reason(and it might be >> possible that the root cause exists in my environment). >> >> ``` >> >> zookage@client-node-0:~$ beeline -e "select min(a.id), max(a.id), >> count(*) from map_join_tb a join map_join_tb b on a.id=b.id;" >> >> Connecting to >> jdbc:hive2://hive-hiveserver2:10000/default;password=dummy;user=zookage >> >> ... >> >> >> ---------------------------------------------------------------------------------------------- >> VERTICES MODE STATUS TOTAL COMPLETED RUNNING >> PENDING FAILED KILLED >> >> ---------------------------------------------------------------------------------------------- >> Map 3 .......... container SUCCEEDED 3 3 0 >> 0 0 0 >> Map 1 .......... container SUCCEEDED 32 32 0 >> 0 0 0 >> Reducer 2 ...... container SUCCEEDED 1 1 0 >> 0 0 0 >> >> ---------------------------------------------------------------------------------------------- >> VERTICES: 03/03 [==========================>>] 100% ELAPSED TIME: 17.36 >> s >> >> ---------------------------------------------------------------------------------------------- >> ... >> +------+----------+---------+ >> | _c0 | _c1 | _c2 | >> +------+----------+---------+ >> | 46 | 7999987 | 249776 | >> +------+----------+---------+ >> >> ... >> >> zookage@client-node-0:~$ beeline -e "select min(a.id), max(a.id), >> count(*) from map_join_tb a join map_join_tb b on a.id=b.id;" --hiveconf >> hive.convert.join.bucket.mapjoin.tez=false >> >> Connecting to >> jdbc:hive2://hive-hiveserver2:10000/default;password=dummy;user=zookage >> >> ... >> >> >> ---------------------------------------------------------------------------------------------- >> VERTICES MODE STATUS TOTAL COMPLETED RUNNING >> PENDING FAILED KILLED >> >> ---------------------------------------------------------------------------------------------- >> Map 1 .......... container SUCCEEDED 3 3 0 >> 0 0 0 >> Map 4 .......... container SUCCEEDED 3 3 0 >> 0 0 0 >> Reducer 2 ...... container SUCCEEDED 1 1 0 >> 0 0 0 >> Reducer 3 ...... container SUCCEEDED 1 1 0 >> 0 0 0 >> >> ---------------------------------------------------------------------------------------------- >> VERTICES: 04/04 [==========================>>] 100% ELAPSED TIME: 29.79 >> s >> >> ---------------------------------------------------------------------------------------------- >> ... >> +------+----------+----------+ >> | _c0 | _c1 | _c2 | >> +------+----------+----------+ >> | 1 | 8000000 | 8000000 | >> +------+----------+----------+ >> >> ``` >> >> To be honest, I am not so familiar with this feature. Another expert >> might give better insights to you. >> >> Thanks. >> >> >> On Sun, Jun 25, 2023 at 8:43 PM smart li <smartli666...@gmail.com> wrote: >> >>> Hello Hive Users, >>> >>> I’m currently trying to understand how Bucket Map Join works in Hive, >>> but I’m encountering some issues that I need help with. Here’s what I did: >>> >>> Firstly, I created a Hive table using the following statement: >>> >>> create table map_join_tb( >>> id int >>> ) >>> clustered by (id) into 32 buckets; >>> >>> Then, I inserted 8 million rows of data into the table, with the ‘id’ >>> field ranging from 1 to 8 million. After the data was bucketed, each bucket >>> was approximately 2MB in size. >>> >>> I then set the following bucket map join configurations: >>> >>> set hive.optimize.bucketmapjoin=true; >>> set hive.enforce.bucketmapjoin=true; >>> >>> Lastly, I ran an EXPLAIN on the following SQL: >>> >>> explain select * from map_join_tb a join map_join_tb b on a.id=b.id; >>> >>> Surprisingly, it seems that a Reduce Join, not a Bucket Map Join, was >>> being performed according to the plan. I’m not sure why this is happening. >>> Under what conditions does Hive decide to perform a Bucket Map Join? >>> >>> For reference, I am using Hive version 3.1.2. Any help or insights into >>> this would be greatly appreciated. Thank you very much in advance. >>> >>> Best Regards, >>> [smartli] >>> >>