Re: Questions Regarding Bucket Map Join in Hive

Okumin Sun, 25 Jun 2023 11:22:55 -0700

Hi smart li,

As far as I tried with Hive 3.1.2 on Tez, Bucket Map Join was probably
triggered. My configurations could be different from yours, though.


# How I tested

## hive-site.xml

https://github.com/zookage/zookage/blob/v0.2.3/kubernetes/base/common/config/hive/hive-site.xml

## Prepare test data

```

zookage@client-node-0:~$ cat generate.sh

#!/bin/bash


for i in {1..8000000}; do

  echo $i

done

zookage@client-node-0:~$ bash generate.sh > /tmp/ids.csv

zookage@client-node-0:~$ hdfs dfs -copyFromLocal /tmp/ids.csv /tmp/ids.csv

zookage@client-node-0:~$ hdfs dfs -tail hdfs:///tmp/ids.csv | tail -n 3

7999998

7999999

8000000

zookage@client-node-0:~$
```

## Load test data

```

zookage@client-node-0:~$ beeline -e "

> create table map_join_tb(id int)

> clustered by (id) into 32 buckets;

> load data inpath '/tmp/ids.csv' into table map_join_tb;

> "

Connecting to
jdbc:hive2://hive-hiveserver2:10000/default;password=dummy;user=zookage

Connected to: Apache Hive (version 3.1.2)

...

zookage@client-node-0:~$ beeline -e "select min(id), max(id), count(*) from
map_join_tb"

Connecting to
jdbc:hive2://hive-hiveserver2:10000/default;password=dummy;user=zookage

...

+------+----------+----------+

| _c0  |   _c1    |   _c2    |

+------+----------+----------+

| 1    | 8000000  | 8000000  |

+------+----------+----------+

zookage@client-node-0:~$ hdfs dfs -ls /user/hive/warehouse/map_join_tb

Found 32 items

-rw-r--r--   3 zookage hive    1965281 2023-06-25 16:55
/user/hive/warehouse/map_join_tb/000000_0

-rw-r--r--   3 zookage hive    1965275 2023-06-25 16:55
/user/hive/warehouse/map_join_tb/000001_0

-rw-r--r--   3 zookage hive    1965275 2023-06-25 16:55
/user/hive/warehouse/map_join_tb/000002_0

...

-rw-r--r--   3 zookage hive    1965279 2023-06-25 16:55
/user/hive/warehouse/map_join_tb/000030_0

-rw-r--r--   3 zookage hive    1965279 2023-06-25 16:55
/user/hive/warehouse/map_join_tb/000031_0

zookage@client-node-0:~$
```
# Submit a join query

Bucket Map Join is chosen. If I remember correctly, we don't need
`hive.optimize.bucketmapjoin` or `hive.enforce.bucketmapjoin` for Tez
because `hive.convert.join.bucket.mapjoin.tez` is for Tez and it is true by
default.

```

zookage@client-node-0:~$ beeline -e "explain select * from map_join_tb a
join map_join_tb b on a.id=b.id;"

Connecting to
jdbc:hive2://hive-hiveserver2:10000/default;password=dummy;user=zookage

...

|     Stage-1                                        |

|       Map 1 vectorized                             |

|       File Output Operator [FS_23]                 |

|         Map Join Operator [MAPJOIN_22] (rows=8800000 width=6) |

|           
BucketMapJoin:true,Conds:SEL_21._col0=RS_19._col0(Inner),HybridGraceHashJoin:true,Output:["_col0","_col1"]
|

...


```

I guess some preconditions are different. You may also see the logs around
ConvertJoinMapJoin on HiveServer 2 but it is sometimes misleading in my
experience.

- You applied some patches to your Hive?
- You use different configurations?
- Stats could be invalid?

By the way, on my machine, Hive 3.1.2 with Bucket Map Join generated
incorrect rows. I have not taken a look at the reason(and it might be
possible that the root cause exists in my environment).

```

zookage@client-node-0:~$ beeline -e "select min(a.id), max(a.id), count(*)
from map_join_tb a join map_join_tb b on a.id=b.id;"

Connecting to
jdbc:hive2://hive-hiveserver2:10000/default;password=dummy;user=zookage

...

----------------------------------------------------------------------------------------------
        VERTICES      MODE        STATUS  TOTAL  COMPLETED  RUNNING
 PENDING  FAILED  KILLED
----------------------------------------------------------------------------------------------
Map 3 .......... container     SUCCEEDED      3          3        0
 0       0       0
Map 1 .......... container     SUCCEEDED     32         32        0
 0       0       0
Reducer 2 ...... container     SUCCEEDED      1          1        0
 0       0       0
----------------------------------------------------------------------------------------------
VERTICES: 03/03  [==========================>>] 100%  ELAPSED TIME: 17.36 s

----------------------------------------------------------------------------------------------
...
+------+----------+---------+
| _c0  |   _c1    |   _c2   |
+------+----------+---------+
| 46   | 7999987  | 249776  |
+------+----------+---------+

...

zookage@client-node-0:~$ beeline -e "select min(a.id), max(a.id), count(*)
from map_join_tb a join map_join_tb b on a.id=b.id;" --hiveconf
hive.convert.join.bucket.mapjoin.tez=false

Connecting to
jdbc:hive2://hive-hiveserver2:10000/default;password=dummy;user=zookage

...

----------------------------------------------------------------------------------------------
        VERTICES      MODE        STATUS  TOTAL  COMPLETED  RUNNING
 PENDING  FAILED  KILLED
----------------------------------------------------------------------------------------------
Map 1 .......... container     SUCCEEDED      3          3        0
 0       0       0
Map 4 .......... container     SUCCEEDED      3          3        0
 0       0       0
Reducer 2 ...... container     SUCCEEDED      1          1        0
 0       0       0
Reducer 3 ...... container     SUCCEEDED      1          1        0
 0       0       0
----------------------------------------------------------------------------------------------
VERTICES: 04/04  [==========================>>] 100%  ELAPSED TIME: 29.79 s

----------------------------------------------------------------------------------------------
...
+------+----------+----------+
| _c0  |   _c1    |   _c2    |
+------+----------+----------+
| 1    | 8000000  | 8000000  |
+------+----------+----------+

```

To be honest, I am not so familiar with this feature. Another expert might
give better insights to you.

Thanks.


On Sun, Jun 25, 2023 at 8:43 PM smart li <smartli666...@gmail.com> wrote:

> Hello Hive Users,
>
> I’m currently trying to understand how Bucket Map Join works in Hive, but
> I’m encountering some issues that I need help with. Here’s what I did:
>
> Firstly, I created a Hive table using the following statement:
>
> create table map_join_tb(
> id int
> )
> clustered by (id) into 32 buckets;
>
> Then, I inserted 8 million rows of data into the table, with the ‘id’
> field ranging from 1 to 8 million. After the data was bucketed, each bucket
> was approximately 2MB in size.
>
> I then set the following bucket map join configurations:
>
> set hive.optimize.bucketmapjoin=true;
> set hive.enforce.bucketmapjoin=true;
>
> Lastly, I ran an EXPLAIN on the following SQL:
>
> explain select * from map_join_tb a join map_join_tb b on a.id=b.id;
>
> Surprisingly, it seems that a Reduce Join, not a Bucket Map Join, was
> being performed according to the plan. I’m not sure why this is happening.
> Under what conditions does Hive decide to perform a Bucket Map Join?
>
> For reference, I am using Hive version 3.1.2. Any help or insights into
> this would be greatly appreciated. Thank you very much in advance.
>
> Best Regards,
> [smartli]
>

Re: Questions Regarding Bucket Map Join in Hive

Reply via email to