Re: Questions Regarding Bucket Map Join in Hive

smart li Sun, 25 Jun 2023 16:30:30 -0700

Hello,

First of all, I would like to express my gratitude for your responses and
assistance. I’m currently encountering a scenario where my Hive is not
choosing BucketMapJoin, and I wonder whether this is due to its underlying
execution engine, which is MapReduce.


In addition, I am operating in a test environment that uses three VMware
virtual machines. The memory allocations for these machines are 4GB, 2GB,
and 2GB, respectively. Of these, the 4GB virtual machine has been assigned
two CPU cores, while the other two have a single CPU core each. I am unsure
whether these factors could influence Hive’s decision to not select
BucketMapJoin.

Any guidance or insight on this matter would be greatly appreciated.

Thanks

Okumin <m...@okumin.com>于2023年6月26日 周一02:23写道：

> Hi smart li,
>
> As far as I tried with Hive 3.1.2 on Tez, Bucket Map Join was probably
> triggered. My configurations could be different from yours, though.
>
> # How I tested
>
> ## hive-site.xml
>
>
> https://github.com/zookage/zookage/blob/v0.2.3/kubernetes/base/common/config/hive/hive-site.xml
>
> ## Prepare test data
>
> ```
>
> zookage@client-node-0:~$ cat generate.sh
>
> #!/bin/bash
>
>
> for i in {1..8000000}; do
>
>   echo $i
>
> done
>
> zookage@client-node-0:~$ bash generate.sh > /tmp/ids.csv
>
> zookage@client-node-0:~$ hdfs dfs -copyFromLocal /tmp/ids.csv /tmp/ids.csv
>
> zookage@client-node-0:~$ hdfs dfs -tail hdfs:///tmp/ids.csv | tail -n 3
>
> 7999998
>
> 7999999
>
> 8000000
>
> zookage@client-node-0:~$
> ```
>
> ## Load test data
>
> ```
>
> zookage@client-node-0:~$ beeline -e "
>
> > create table map_join_tb(id int)
>
> > clustered by (id) into 32 buckets;
>
> > load data inpath '/tmp/ids.csv' into table map_join_tb;
>
> > "
>
> Connecting to
> jdbc:hive2://hive-hiveserver2:10000/default;password=dummy;user=zookage
>
> Connected to: Apache Hive (version 3.1.2)
>
> ...
>
> zookage@client-node-0:~$ beeline -e "select min(id), max(id), count(*)
> from map_join_tb"
>
> Connecting to
> jdbc:hive2://hive-hiveserver2:10000/default;password=dummy;user=zookage
>
> ...
>
> +------+----------+----------+
>
> | _c0  |   _c1    |   _c2    |
>
> +------+----------+----------+
>
> | 1    | 8000000  | 8000000  |
>
> +------+----------+----------+
>
> zookage@client-node-0:~$ hdfs dfs -ls /user/hive/warehouse/map_join_tb
>
> Found 32 items
>
> -rw-r--r--   3 zookage hive    1965281 2023-06-25 16:55
> /user/hive/warehouse/map_join_tb/000000_0
>
> -rw-r--r--   3 zookage hive    1965275 2023-06-25 16:55
> /user/hive/warehouse/map_join_tb/000001_0
>
> -rw-r--r--   3 zookage hive    1965275 2023-06-25 16:55
> /user/hive/warehouse/map_join_tb/000002_0
>
> ...
>
> -rw-r--r--   3 zookage hive    1965279 2023-06-25 16:55
> /user/hive/warehouse/map_join_tb/000030_0
>
> -rw-r--r--   3 zookage hive    1965279 2023-06-25 16:55
> /user/hive/warehouse/map_join_tb/000031_0
>
> zookage@client-node-0:~$
> ```
> # Submit a join query
>
> Bucket Map Join is chosen. If I remember correctly, we don't need
> `hive.optimize.bucketmapjoin` or `hive.enforce.bucketmapjoin` for Tez
> because `hive.convert.join.bucket.mapjoin.tez` is for Tez and it is true by
> default.
>
> ```
>
> zookage@client-node-0:~$ beeline -e "explain select * from map_join_tb a
> join map_join_tb b on a.id=b.id;"
>
> Connecting to
> jdbc:hive2://hive-hiveserver2:10000/default;password=dummy;user=zookage
>
> ...
>
> |     Stage-1                                        |
>
> |       Map 1 vectorized                             |
>
> |       File Output Operator [FS_23]                 |
>
> |         Map Join Operator [MAPJOIN_22] (rows=8800000 width=6) |
>
> |           
> BucketMapJoin:true,Conds:SEL_21._col0=RS_19._col0(Inner),HybridGraceHashJoin:true,Output:["_col0","_col1"]
> |
>
> ...
>
>
> ```
>
> I guess some preconditions are different. You may also see the logs around
> ConvertJoinMapJoin on HiveServer 2 but it is sometimes misleading in my
> experience.
>
> - You applied some patches to your Hive?
> - You use different configurations?
> - Stats could be invalid?
>
> By the way, on my machine, Hive 3.1.2 with Bucket Map Join generated
> incorrect rows. I have not taken a look at the reason(and it might be
> possible that the root cause exists in my environment).
>
> ```
>
> zookage@client-node-0:~$ beeline -e "select min(a.id), max(a.id),
> count(*) from map_join_tb a join map_join_tb b on a.id=b.id;"
>
> Connecting to
> jdbc:hive2://hive-hiveserver2:10000/default;password=dummy;user=zookage
>
> ...
>
>
> ----------------------------------------------------------------------------------------------
>         VERTICES      MODE        STATUS  TOTAL  COMPLETED  RUNNING
>  PENDING  FAILED  KILLED
>
> ----------------------------------------------------------------------------------------------
> Map 3 .......... container     SUCCEEDED      3          3        0
>  0       0       0
> Map 1 .......... container     SUCCEEDED     32         32        0
>  0       0       0
> Reducer 2 ...... container     SUCCEEDED      1          1        0
>  0       0       0
>
> ----------------------------------------------------------------------------------------------
> VERTICES: 03/03  [==========================>>] 100%  ELAPSED TIME: 17.36
> s
>
> ----------------------------------------------------------------------------------------------
> ...
> +------+----------+---------+
> | _c0  |   _c1    |   _c2   |
> +------+----------+---------+
> | 46   | 7999987  | 249776  |
> +------+----------+---------+
>
> ...
>
> zookage@client-node-0:~$ beeline -e "select min(a.id), max(a.id),
> count(*) from map_join_tb a join map_join_tb b on a.id=b.id;" --hiveconf
> hive.convert.join.bucket.mapjoin.tez=false
>
> Connecting to
> jdbc:hive2://hive-hiveserver2:10000/default;password=dummy;user=zookage
>
> ...
>
>
> ----------------------------------------------------------------------------------------------
>         VERTICES      MODE        STATUS  TOTAL  COMPLETED  RUNNING
>  PENDING  FAILED  KILLED
>
> ----------------------------------------------------------------------------------------------
> Map 1 .......... container     SUCCEEDED      3          3        0
>  0       0       0
> Map 4 .......... container     SUCCEEDED      3          3        0
>  0       0       0
> Reducer 2 ...... container     SUCCEEDED      1          1        0
>  0       0       0
> Reducer 3 ...... container     SUCCEEDED      1          1        0
>  0       0       0
>
> ----------------------------------------------------------------------------------------------
> VERTICES: 04/04  [==========================>>] 100%  ELAPSED TIME: 29.79
> s
>
> ----------------------------------------------------------------------------------------------
> ...
> +------+----------+----------+
> | _c0  |   _c1    |   _c2    |
> +------+----------+----------+
> | 1    | 8000000  | 8000000  |
> +------+----------+----------+
>
> ```
>
> To be honest, I am not so familiar with this feature. Another expert might
> give better insights to you.
>
> Thanks.
>
>
> On Sun, Jun 25, 2023 at 8:43 PM smart li <smartli666...@gmail.com> wrote:
>
>> Hello Hive Users,
>>
>> I’m currently trying to understand how Bucket Map Join works in Hive, but
>> I’m encountering some issues that I need help with. Here’s what I did:
>>
>> Firstly, I created a Hive table using the following statement:
>>
>> create table map_join_tb(
>> id int
>> )
>> clustered by (id) into 32 buckets;
>>
>> Then, I inserted 8 million rows of data into the table, with the ‘id’
>> field ranging from 1 to 8 million. After the data was bucketed, each bucket
>> was approximately 2MB in size.
>>
>> I then set the following bucket map join configurations:
>>
>> set hive.optimize.bucketmapjoin=true;
>> set hive.enforce.bucketmapjoin=true;
>>
>> Lastly, I ran an EXPLAIN on the following SQL:
>>
>> explain select * from map_join_tb a join map_join_tb b on a.id=b.id;
>>
>> Surprisingly, it seems that a Reduce Join, not a Bucket Map Join, was
>> being performed according to the plan. I’m not sure why this is happening.
>> Under what conditions does Hive decide to perform a Bucket Map Join?
>>
>> For reference, I am using Hive version 3.1.2. Any help or insights into
>> this would be greatly appreciated. Thank you very much in advance.
>>
>> Best Regards,
>> [smartli]
>>
>

Re: Questions Regarding Bucket Map Join in Hive

Reply via email to