Hi all, I was trying out some hive optimization features and encountered such problem: I cannot use bucket map join in hive 0.12. After all the setting I tried below, only one hashtable file is generated and the join turn out to be just map join.
I have two tables both in rcfile format and both bucktized into 10 bucket, they are created and populated as follows(Origin data was generated from TPC-H): hive> create table lsm (l_orderkey int, l_partkey int, l_suppkey int, l_linenumber int, l_quantity double, l_extendedprice double, l_discount double, l_tax double, l_returnflag string, l_linestatus string, l_shipdate string, l_commitdate string, l_receiptdate string, l_shipstruct string, l_shipmode string, l_comment string) clustered by (l_orderkey) into 10 buckets stored as rcfile; hive> create table osm (o_orderkey int, o_custkey int) clustered by (o_orderkey) into 10 buckets stored as rcfile; hive> set hive.enforce.bucketing=true; hive> insert overwrite table lsm select * from orili; hive> insert overwrite table osm select o_orderkey, o_custkey from orior; And I can query both table’s data normally, and lsm is 790MB, osm is 11MB, then I want to try bucket map join: hive> set hive.auto.convert.join=true; hive> set hive.optimize.bucketmapjoin=true; hive> set hive.enforce.bucketmapjoin=true; hive> set hive.auto.convert.join.noconditionaltask=true; hive> set hive.auto.convert.join.noconditionaltask.size=1000000000000000; hive> set hive.input.format=org.apache.hadoop.hive.ql.io.BucketizedHiveInputFormat; and my query is as follows: hive> select /*+ Mapjoin(osm) */ osm.o_orderkey, lsm.* from osm join lsm on osm.o_orderkey = lsm.l_orderkey; This query only generate 1 hashtable of osm and fall back to a map join, I was really confused about it. Do I have all the hint set to enable the bucket map join feature, or are there any problems in my query ? Thanks, Henry