Re: Questions Regarding Bucket Map Join in Hive

Okumin Sat, 01 Jul 2023 09:42:34 -0700

Hi,

I understand you are trying MapReduce! I recommend you use Tez unless you
have special reasons. Tez is the recommended engine and I guess more
community members use Hive 3 on Tez. It means you are more likely to get
answers when you encounter trouble.


Quickly, I succeeded in enabling Bucket Map Join with Hive on MR in the
following settings. Looks like, Bucket Map Join on MR requires the original
plan to have Map Join, so I added a hint with
`hive.ignore.mapjoin.hint=false`. I added `hive.cbo.enable=false` with my
esper. I'm not sure why my environment needed it.

```

$ beeline -e "explain extended select /*+ MAPJOIN(b) */ * from map_join_tb
a join map_join_tb b on a.id=b.id;" --hiveconf
hive.optimize.bucketmapjoin=true --hiveconf hive.enforce.bucketing=true
--hiveconf hive.ignore.mapjoin.hint=false --hiveconf hive.cbo.enable=false

...

|       Map Operator Tree:                           |

|           TableScan                                |

|             alias: a                               |

...

|               Map Join Operator                    |

...

|                 BucketMapJoin: true
```

Regards,
Okumin

On Mon, Jun 26, 2023 at 8:30 AM smart li <smartli666...@gmail.com> wrote:

> Hello,
>
> First of all, I would like to express my gratitude for your responses and
> assistance. I’m currently encountering a scenario where my Hive is not
> choosing BucketMapJoin, and I wonder whether this is due to its underlying
> execution engine, which is MapReduce.
>
> In addition, I am operating in a test environment that uses three VMware
> virtual machines. The memory allocations for these machines are 4GB, 2GB,
> and 2GB, respectively. Of these, the 4GB virtual machine has been assigned
> two CPU cores, while the other two have a single CPU core each. I am unsure
> whether these factors could influence Hive’s decision to not select
> BucketMapJoin.
>
> Any guidance or insight on this matter would be greatly appreciated.
>
> Thanks
>
> Okumin <m...@okumin.com>于2023年6月26日 周一02:23写道：
>
>> Hi smart li,
>>
>> As far as I tried with Hive 3.1.2 on Tez, Bucket Map Join was probably
>> triggered. My configurations could be different from yours, though.
>>
>> # How I tested
>>
>> ## hive-site.xml
>>
>>
>> https://github.com/zookage/zookage/blob/v0.2.3/kubernetes/base/common/config/hive/hive-site.xml
>>
>> ## Prepare test data
>>
>> ```
>>
>> zookage@client-node-0:~$ cat generate.sh
>>
>> #!/bin/bash
>>
>>
>> for i in {1..8000000}; do
>>
>>   echo $i
>>
>> done
>>
>> zookage@client-node-0:~$ bash generate.sh > /tmp/ids.csv
>>
>> zookage@client-node-0:~$ hdfs dfs -copyFromLocal /tmp/ids.csv
>> /tmp/ids.csv
>>
>> zookage@client-node-0:~$ hdfs dfs -tail hdfs:///tmp/ids.csv | tail -n 3
>>
>> 7999998
>>
>> 7999999
>>
>> 8000000
>>
>> zookage@client-node-0:~$
>> ```
>>
>> ## Load test data
>>
>> ```
>>
>> zookage@client-node-0:~$ beeline -e "
>>
>> > create table map_join_tb(id int)
>>
>> > clustered by (id) into 32 buckets;
>>
>> > load data inpath '/tmp/ids.csv' into table map_join_tb;
>>
>> > "
>>
>> Connecting to
>> jdbc:hive2://hive-hiveserver2:10000/default;password=dummy;user=zookage
>>
>> Connected to: Apache Hive (version 3.1.2)
>>
>> ...
>>
>> zookage@client-node-0:~$ beeline -e "select min(id), max(id), count(*)
>> from map_join_tb"
>>
>> Connecting to
>> jdbc:hive2://hive-hiveserver2:10000/default;password=dummy;user=zookage
>>
>> ...
>>
>> +------+----------+----------+
>>
>> | _c0  |   _c1    |   _c2    |
>>
>> +------+----------+----------+
>>
>> | 1    | 8000000  | 8000000  |
>>
>> +------+----------+----------+
>>
>> zookage@client-node-0:~$ hdfs dfs -ls /user/hive/warehouse/map_join_tb
>>
>> Found 32 items
>>
>> -rw-r--r--   3 zookage hive    1965281 2023-06-25 16:55
>> /user/hive/warehouse/map_join_tb/000000_0
>>
>> -rw-r--r--   3 zookage hive    1965275 2023-06-25 16:55
>> /user/hive/warehouse/map_join_tb/000001_0
>>
>> -rw-r--r--   3 zookage hive    1965275 2023-06-25 16:55
>> /user/hive/warehouse/map_join_tb/000002_0
>>
>> ...
>>
>> -rw-r--r--   3 zookage hive    1965279 2023-06-25 16:55
>> /user/hive/warehouse/map_join_tb/000030_0
>>
>> -rw-r--r--   3 zookage hive    1965279 2023-06-25 16:55
>> /user/hive/warehouse/map_join_tb/000031_0
>>
>> zookage@client-node-0:~$
>> ```
>> # Submit a join query
>>
>> Bucket Map Join is chosen. If I remember correctly, we don't need
>> `hive.optimize.bucketmapjoin` or `hive.enforce.bucketmapjoin` for Tez
>> because `hive.convert.join.bucket.mapjoin.tez` is for Tez and it is true by
>> default.
>>
>> ```
>>
>> zookage@client-node-0:~$ beeline -e "explain select * from map_join_tb a
>> join map_join_tb b on a.id=b.id;"
>>
>> Connecting to
>> jdbc:hive2://hive-hiveserver2:10000/default;password=dummy;user=zookage
>>
>> ...
>>
>> |     Stage-1                                        |
>>
>> |       Map 1 vectorized                             |
>>
>> |       File Output Operator [FS_23]                 |
>>
>> |         Map Join Operator [MAPJOIN_22] (rows=8800000 width=6) |
>>
>> |           
>> BucketMapJoin:true,Conds:SEL_21._col0=RS_19._col0(Inner),HybridGraceHashJoin:true,Output:["_col0","_col1"]
>> |
>>
>> ...
>>
>>
>> ```
>>
>> I guess some preconditions are different. You may also see the logs
>> around ConvertJoinMapJoin on HiveServer 2 but it is sometimes misleading in
>> my experience.
>>
>> - You applied some patches to your Hive?
>> - You use different configurations?
>> - Stats could be invalid?
>>
>> By the way, on my machine, Hive 3.1.2 with Bucket Map Join generated
>> incorrect rows. I have not taken a look at the reason(and it might be
>> possible that the root cause exists in my environment).
>>
>> ```
>>
>> zookage@client-node-0:~$ beeline -e "select min(a.id), max(a.id),
>> count(*) from map_join_tb a join map_join_tb b on a.id=b.id;"
>>
>> Connecting to
>> jdbc:hive2://hive-hiveserver2:10000/default;password=dummy;user=zookage
>>
>> ...
>>
>>
>> ----------------------------------------------------------------------------------------------
>>         VERTICES      MODE        STATUS  TOTAL  COMPLETED  RUNNING
>>  PENDING  FAILED  KILLED
>>
>> ----------------------------------------------------------------------------------------------
>> Map 3 .......... container     SUCCEEDED      3          3        0
>>  0       0       0
>> Map 1 .......... container     SUCCEEDED     32         32        0
>>  0       0       0
>> Reducer 2 ...... container     SUCCEEDED      1          1        0
>>  0       0       0
>>
>> ----------------------------------------------------------------------------------------------
>> VERTICES: 03/03  [==========================>>] 100%  ELAPSED TIME: 17.36
>> s
>>
>> ----------------------------------------------------------------------------------------------
>> ...
>> +------+----------+---------+
>> | _c0  |   _c1    |   _c2   |
>> +------+----------+---------+
>> | 46   | 7999987  | 249776  |
>> +------+----------+---------+
>>
>> ...
>>
>> zookage@client-node-0:~$ beeline -e "select min(a.id), max(a.id),
>> count(*) from map_join_tb a join map_join_tb b on a.id=b.id;" --hiveconf
>> hive.convert.join.bucket.mapjoin.tez=false
>>
>> Connecting to
>> jdbc:hive2://hive-hiveserver2:10000/default;password=dummy;user=zookage
>>
>> ...
>>
>>
>> ----------------------------------------------------------------------------------------------
>>         VERTICES      MODE        STATUS  TOTAL  COMPLETED  RUNNING
>>  PENDING  FAILED  KILLED
>>
>> ----------------------------------------------------------------------------------------------
>> Map 1 .......... container     SUCCEEDED      3          3        0
>>  0       0       0
>> Map 4 .......... container     SUCCEEDED      3          3        0
>>  0       0       0
>> Reducer 2 ...... container     SUCCEEDED      1          1        0
>>  0       0       0
>> Reducer 3 ...... container     SUCCEEDED      1          1        0
>>  0       0       0
>>
>> ----------------------------------------------------------------------------------------------
>> VERTICES: 04/04  [==========================>>] 100%  ELAPSED TIME: 29.79
>> s
>>
>> ----------------------------------------------------------------------------------------------
>> ...
>> +------+----------+----------+
>> | _c0  |   _c1    |   _c2    |
>> +------+----------+----------+
>> | 1    | 8000000  | 8000000  |
>> +------+----------+----------+
>>
>> ```
>>
>> To be honest, I am not so familiar with this feature. Another expert
>> might give better insights to you.
>>
>> Thanks.
>>
>>
>> On Sun, Jun 25, 2023 at 8:43 PM smart li <smartli666...@gmail.com> wrote:
>>
>>> Hello Hive Users,
>>>
>>> I’m currently trying to understand how Bucket Map Join works in Hive,
>>> but I’m encountering some issues that I need help with. Here’s what I did:
>>>
>>> Firstly, I created a Hive table using the following statement:
>>>
>>> create table map_join_tb(
>>> id int
>>> )
>>> clustered by (id) into 32 buckets;
>>>
>>> Then, I inserted 8 million rows of data into the table, with the ‘id’
>>> field ranging from 1 to 8 million. After the data was bucketed, each bucket
>>> was approximately 2MB in size.
>>>
>>> I then set the following bucket map join configurations:
>>>
>>> set hive.optimize.bucketmapjoin=true;
>>> set hive.enforce.bucketmapjoin=true;
>>>
>>> Lastly, I ran an EXPLAIN on the following SQL:
>>>
>>> explain select * from map_join_tb a join map_join_tb b on a.id=b.id;
>>>
>>> Surprisingly, it seems that a Reduce Join, not a Bucket Map Join, was
>>> being performed according to the plan. I’m not sure why this is happening.
>>> Under what conditions does Hive decide to perform a Bucket Map Join?
>>>
>>> For reference, I am using Hive version 3.1.2. Any help or insights into
>>> this would be greatly appreciated. Thank you very much in advance.
>>>
>>> Best Regards,
>>> [smartli]
>>>
>>

Re: Questions Regarding Bucket Map Join in Hive

Reply via email to