[jira] [Updated] (HIVE-27267) Incorrect results when doing bucket map join on decimal bucketed column with subquery

Sourabh Badhya (Jira) Mon, 17 Apr 2023 05:07:33 -0700


     [ 
https://issues.apache.org/jira/browse/HIVE-27267?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Sourabh Badhya updated HIVE-27267:
----------------------------------
    Description: 
The following queries when run on a Hive cluster produce no results - 
Repro queries - 
{code:java}
set hive.txn.manager=org.apache.hadoop.hive.ql.lockmgr.DbTxnManager;
set hive.support.concurrency=true;
set hive.convert.join.bucket.mapjoin.tez=true;

drop table if exists test_external_source;
create external table test_external_source (date_col date, string_col string, 
decimal_col decimal(38,0)) stored as orc tblproperties 
('external.table.purge'='true');
insert into table test_external_source values ('2022-08-30', 'pipeline', 
'50000000000000000005905545593'), ('2022-08-16', 'pipeline', 
'50000000000000000005905545593'), ('2022-09-01', 'pipeline', 
'50000000000000000006008686831'), ('2022-08-30', 'pipeline', 
'50000000000000000005992620837'), ('2022-09-01', 'pipeline', 
'50000000000000000005992620837'), ('2022-09-01', 'pipeline', 
'50000000000000000005992621067'), ('2022-08-30', 'pipeline', 
'50000000000000000005992621067');

drop table if exists test_external_target;
create external table test_external_target (date_col date, string_col string, 
decimal_col decimal(38,0)) stored as orc tblproperties 
('external.table.purge'='true');
insert into table test_external_target values ('2017-05-17', 'pipeline', 
'50000000000000000000441610525'), ('2018-12-20', 'pipeline', 
'50000000000000000001048981030'), ('2020-06-30', 'pipeline', 
'50000000000000000002332575516'), ('2021-08-16', 'pipeline', 
'50000000000000000003897973989'), ('2017-06-06', 'pipeline', 
'50000000000000000000449148729'), ('2017-09-08', 'pipeline', 
'50000000000000000000525378314'), ('2022-08-30', 'pipeline', 
'50000000000000000005905545593'), ('2022-08-16', 'pipeline', 
'50000000000000000005905545593'), ('2018-05-03', 'pipeline', 
'50000000000000000000750826355'), ('2020-01-10', 'pipeline', 
'50000000000000000001816579677'), ('2021-11-01', 'pipeline', 
'50000000000000000004269423714'), ('2017-11-07', 'pipeline', 
'50000000000000000000585901787'), ('2019-10-15', 'pipeline', 
'50000000000000000001598843430'), ('2020-04-01', 'pipeline', 
'50000000000000000002035795461'), ('2020-02-24', 'pipeline', 
'50000000000000000001932600185'), ('2020-04-27', 'pipeline', 
'50000000000000000002108160849'), ('2016-07-05', 'pipeline', 
'50000000000000000000054405114'), ('2020-06-02', 'pipeline', 
'50000000000000000002234387967'), ('2020-08-21', 'pipeline', 
'50000000000000000002529168758'), ('2021-02-17', 'pipeline', 
'50000000000000000003158511687');

drop table if exists target_table;
drop table if exists source_table;
create table target_table(date_col date, string_col string, decimal_col 
decimal(38,0)) clustered by (decimal_col) into 7 buckets stored as orc 
tblproperties ('bucketing_version'='2', 'transactional'='true', 
'transactional_properties'='default');
create table source_table(date_col date, string_col string, decimal_col 
decimal(38,0)) clustered by (decimal_col) into 7 buckets stored as orc 
tblproperties ('bucketing_version'='2', 'transactional'='true', 
'transactional_properties'='default');

insert into table target_table select * from test_external_target;
insert into table source_table select * from test_external_source; {code}
Query which is under investigation - 
{code:java}
select * from target_table inner join (select distinct date_col, 'pipeline' 
string_col, decimal_col from source_table where coalesce(decimal_col,'') = 
'50000000000000000005905545593') s on s.date_col = target_table.date_col AND 
s.string_col = target_table.string_col AND s.decimal_col = 
target_table.decimal_col; {code}
Expected result of the query - 2 records
{code:java}
+------------------------+--------------------------+--------------------------------+-------------+---------------+--------------------------------+
| target_table.date_col  | target_table.string_col  |    
target_table.decimal_col    | s.date_col  | s.string_col  |         
s.decimal_col          |
+------------------------+--------------------------+--------------------------------+-------------+---------------+--------------------------------+
| 2022-08-16             | pipeline                 | 
50000000000000000005905545593  | 2022-08-16  | pipeline      | 
50000000000000000005905545593  |
| 2022-08-30             | pipeline                 | 
50000000000000000005905545593  | 2022-08-30  | pipeline      | 
50000000000000000005905545593  |
+------------------------+--------------------------+--------------------------------+-------------+---------------+--------------------------------+
 {code}
Actual result of the query - No records
{code:java}
+------------------------+--------------------------+---------------------------+-------------+---------------+----------------+
| target_table.date_col  | target_table.string_col  | target_table.decimal_col  
| s.date_col  | s.string_col  | s.decimal_col  |
+------------------------+--------------------------+---------------------------+-------------+---------------+----------------+
+------------------------+--------------------------+---------------------------+-------------+---------------+----------------+
 {code}
The workaround which fetches the correct result here is to set the below config 
to false - 
{code:java}
set hive.convert.join.bucket.mapjoin.tez=false;{code}
Notes from investigation - 
1. The batch containing the 2 results are forwarded correctly to the map join 
operator. However, during the join comparision, the hash table is empty.
2. The problem seems to be that HashTableDummyOperator performs loading of hash 
table with the records, however the map join operator does not take into 
account all the hash tables from various instances of HashTableDummyOperator 
(due to multiple map tasks initiated by bucket map join) but rather uses only 
one hash table from one of the HashTableDummyOperator instance. In this case, 
the selected instance had an empty hash table hence no records were matched in 
the join operator.
3. If the table is unbucketed / 1-bucketed, then the results are correct. There 
is only 1 map task which is spawned which loads the records into the hash 
table. The workaround (setting *hive.convert.join.bucket.mapjoin.tez* to 
{*}false{*}) also has the same effect since there is 1 map task which loads the 
records into the hash table.
4. HashTableDummyOperator is created in the optimizer and is associated with 
the plan, hence suspecting there is a some issue in the optimizer code. 
Ideally, all hash tables from all instances of HashTableDummyOperator must be 
used by the map join operator.

  was:
The following queries when run on a Hive cluster produce no results - 
Repro queries - 
{code:java}
set hive.txn.manager=org.apache.hadoop.hive.ql.lockmgr.DbTxnManager;
set hive.support.concurrency=true;
set hive.convert.join.bucket.mapjoin.tez=true;

drop table if exists test_external_source;
create external table test_external_source (date_col date, string_col string, 
decimal_col decimal(38,0)) stored as orc tblproperties 
('external.table.purge'='true');
insert into table test_external_source values ('2022-08-30', 'pipeline', 
'50000000000000000005905545593'), ('2022-08-16', 'pipeline', 
'50000000000000000005905545593'), ('2022-09-01', 'pipeline', 
'50000000000000000006008686831'), ('2022-08-30', 'pipeline', 
'50000000000000000005992620837'), ('2022-09-01', 'pipeline', 
'50000000000000000005992620837'), ('2022-09-01', 'pipeline', 
'50000000000000000005992621067'), ('2022-08-30', 'pipeline', 
'50000000000000000005992621067');

drop table if exists test_external_target;
create external table test_external_target (date_col date, string_col string, 
decimal_col decimal(38,0)) stored as orc tblproperties 
('external.table.purge'='true');
insert into table test_external_target values ('2017-05-17', 'pipeline', 
'50000000000000000000441610525'), ('2018-12-20', 'pipeline', 
'50000000000000000001048981030'), ('2020-06-30', 'pipeline', 
'50000000000000000002332575516'), ('2021-08-16', 'pipeline', 
'50000000000000000003897973989'), ('2017-06-06', 'pipeline', 
'50000000000000000000449148729'), ('2017-09-08', 'pipeline', 
'50000000000000000000525378314'), ('2022-08-30', 'pipeline', 
'50000000000000000005905545593'), ('2022-08-16', 'pipeline', 
'50000000000000000005905545593'), ('2018-05-03', 'pipeline', 
'50000000000000000000750826355'), ('2020-01-10', 'pipeline', 
'50000000000000000001816579677'), ('2021-11-01', 'pipeline', 
'50000000000000000004269423714'), ('2017-11-07', 'pipeline', 
'50000000000000000000585901787'), ('2019-10-15', 'pipeline', 
'50000000000000000001598843430'), ('2020-04-01', 'pipeline', 
'50000000000000000002035795461'), ('2020-02-24', 'pipeline', 
'50000000000000000001932600185'), ('2020-04-27', 'pipeline', 
'50000000000000000002108160849'), ('2016-07-05', 'pipeline', 
'50000000000000000000054405114'), ('2020-06-02', 'pipeline', 
'50000000000000000002234387967'), ('2020-08-21', 'pipeline', 
'50000000000000000002529168758'), ('2021-02-17', 'pipeline', 
'50000000000000000003158511687');

drop table if exists target_table;
drop table if exists source_table;
create table target_table(date_col date, string_col string, decimal_col 
decimal(38,0)) clustered by (decimal_col) into 7 buckets stored as orc 
tblproperties ('bucketing_version'='2', 'transactional'='true', 
'transactional_properties'='default');
create table source_table(date_col date, string_col string, decimal_col 
decimal(38,0)) clustered by (decimal_col) into 7 buckets stored as orc 
tblproperties ('bucketing_version'='2', 'transactional'='true', 
'transactional_properties'='default');

insert into table target_table select * from test_external_target;
insert into table source_table select * from test_external_source; {code}
Query which is under investigation - 
{code:java}
select * from target_table inner join (select distinct date_col, 'pipeline' 
string_col, decimal_col from source_table where coalesce(decimal_col,'') = 
'50000000000000000005905545593') s on s.date_col = target_table.date_col AND 
s.string_col = target_table.string_col AND s.decimal_col = 
target_table.decimal_col; {code}
Expected result of the query - 2 records
{code:java}
+------------------------+--------------------------+--------------------------------+-------------+---------------+--------------------------------+
| target_table.date_col  | target_table.string_col  |    
target_table.decimal_col    | s.date_col  | s.string_col  |         
s.decimal_col          |
+------------------------+--------------------------+--------------------------------+-------------+---------------+--------------------------------+
| 2022-08-16             | pipeline                 | 
50000000000000000005905545593  | 2022-08-16  | pipeline      | 
50000000000000000005905545593  |
| 2022-08-30             | pipeline                 | 
50000000000000000005905545593  | 2022-08-30  | pipeline      | 
50000000000000000005905545593  |
+------------------------+--------------------------+--------------------------------+-------------+---------------+--------------------------------+
 {code}
Actual result of the query - No records
{code:java}
+------------------------+--------------------------+---------------------------+-------------+---------------+----------------+
| target_table.date_col  | target_table.string_col  | target_table.decimal_col  
| s.date_col  | s.string_col  | s.decimal_col  |
+------------------------+--------------------------+---------------------------+-------------+---------------+----------------+
+------------------------+--------------------------+---------------------------+-------------+---------------+----------------+
 {code}
The workaround which fetches the correct result here is to set the below config 
to false - 

 
{code:java}
set hive.convert.join.bucket.mapjoin.tez=false;{code}
Notes from investigation - 
1. The batch containing the 2 results are forwarded correctly to the map join 
operator. However, during the join comparision, the hash table is empty.
2. The problem seems to be that HashTableDummyOperator performs loading of hash 
table with the records, however the map join operator does not take into 
account all the hash tables from various instances of HashTableDummyOperator 
(due to multiple map tasks initiated by bucket map join) but rather uses only 
one hash table from one of the HashTableDummyOperator instance. In this case, 
the selected instance had an empty hash table hence no records were matched in 
the join operator.
3. If the table is unbucketed / 1-bucketed, then the results are correct. There 
is only 1 map task which is spawned which loads the records into the hash 
table. The workaround (setting *hive.convert.join.bucket.mapjoin.tez* to 
{*}false{*}) also has the same effect since there is 1 map task which loads the 
records into the hash table.
4. HashTableDummyOperator is created in the optimizer and is associated with 
the plan, hence suspecting there is a some issue in the optimizer code. 
Ideally, all hash tables from all instances of HashTableDummyOperator must be 
used by the map join operator.

 


> Incorrect results when doing bucket map join on decimal bucketed column with 
> subquery
> -------------------------------------------------------------------------------------
>
>                 Key: HIVE-27267
>                 URL: https://issues.apache.org/jira/browse/HIVE-27267
>             Project: Hive
>          Issue Type: Bug
>            Reporter: Sourabh Badhya
>            Priority: Major
>
> The following queries when run on a Hive cluster produce no results - 
> Repro queries - 
> {code:java}
> set hive.txn.manager=org.apache.hadoop.hive.ql.lockmgr.DbTxnManager;
> set hive.support.concurrency=true;
> set hive.convert.join.bucket.mapjoin.tez=true;
> drop table if exists test_external_source;
> create external table test_external_source (date_col date, string_col string, 
> decimal_col decimal(38,0)) stored as orc tblproperties 
> ('external.table.purge'='true');
> insert into table test_external_source values ('2022-08-30', 'pipeline', 
> '50000000000000000005905545593'), ('2022-08-16', 'pipeline', 
> '50000000000000000005905545593'), ('2022-09-01', 'pipeline', 
> '50000000000000000006008686831'), ('2022-08-30', 'pipeline', 
> '50000000000000000005992620837'), ('2022-09-01', 'pipeline', 
> '50000000000000000005992620837'), ('2022-09-01', 'pipeline', 
> '50000000000000000005992621067'), ('2022-08-30', 'pipeline', 
> '50000000000000000005992621067');
> drop table if exists test_external_target;
> create external table test_external_target (date_col date, string_col string, 
> decimal_col decimal(38,0)) stored as orc tblproperties 
> ('external.table.purge'='true');
> insert into table test_external_target values ('2017-05-17', 'pipeline', 
> '50000000000000000000441610525'), ('2018-12-20', 'pipeline', 
> '50000000000000000001048981030'), ('2020-06-30', 'pipeline', 
> '50000000000000000002332575516'), ('2021-08-16', 'pipeline', 
> '50000000000000000003897973989'), ('2017-06-06', 'pipeline', 
> '50000000000000000000449148729'), ('2017-09-08', 'pipeline', 
> '50000000000000000000525378314'), ('2022-08-30', 'pipeline', 
> '50000000000000000005905545593'), ('2022-08-16', 'pipeline', 
> '50000000000000000005905545593'), ('2018-05-03', 'pipeline', 
> '50000000000000000000750826355'), ('2020-01-10', 'pipeline', 
> '50000000000000000001816579677'), ('2021-11-01', 'pipeline', 
> '50000000000000000004269423714'), ('2017-11-07', 'pipeline', 
> '50000000000000000000585901787'), ('2019-10-15', 'pipeline', 
> '50000000000000000001598843430'), ('2020-04-01', 'pipeline', 
> '50000000000000000002035795461'), ('2020-02-24', 'pipeline', 
> '50000000000000000001932600185'), ('2020-04-27', 'pipeline', 
> '50000000000000000002108160849'), ('2016-07-05', 'pipeline', 
> '50000000000000000000054405114'), ('2020-06-02', 'pipeline', 
> '50000000000000000002234387967'), ('2020-08-21', 'pipeline', 
> '50000000000000000002529168758'), ('2021-02-17', 'pipeline', 
> '50000000000000000003158511687');
> drop table if exists target_table;
> drop table if exists source_table;
> create table target_table(date_col date, string_col string, decimal_col 
> decimal(38,0)) clustered by (decimal_col) into 7 buckets stored as orc 
> tblproperties ('bucketing_version'='2', 'transactional'='true', 
> 'transactional_properties'='default');
> create table source_table(date_col date, string_col string, decimal_col 
> decimal(38,0)) clustered by (decimal_col) into 7 buckets stored as orc 
> tblproperties ('bucketing_version'='2', 'transactional'='true', 
> 'transactional_properties'='default');
> insert into table target_table select * from test_external_target;
> insert into table source_table select * from test_external_source; {code}
> Query which is under investigation - 
> {code:java}
> select * from target_table inner join (select distinct date_col, 'pipeline' 
> string_col, decimal_col from source_table where coalesce(decimal_col,'') = 
> '50000000000000000005905545593') s on s.date_col = target_table.date_col AND 
> s.string_col = target_table.string_col AND s.decimal_col = 
> target_table.decimal_col; {code}
> Expected result of the query - 2 records
> {code:java}
> +------------------------+--------------------------+--------------------------------+-------------+---------------+--------------------------------+
> | target_table.date_col  | target_table.string_col  |    
> target_table.decimal_col    | s.date_col  | s.string_col  |         
> s.decimal_col          |
> +------------------------+--------------------------+--------------------------------+-------------+---------------+--------------------------------+
> | 2022-08-16             | pipeline                 | 
> 50000000000000000005905545593  | 2022-08-16  | pipeline      | 
> 50000000000000000005905545593  |
> | 2022-08-30             | pipeline                 | 
> 50000000000000000005905545593  | 2022-08-30  | pipeline      | 
> 50000000000000000005905545593  |
> +------------------------+--------------------------+--------------------------------+-------------+---------------+--------------------------------+
>  {code}
> Actual result of the query - No records
> {code:java}
> +------------------------+--------------------------+---------------------------+-------------+---------------+----------------+
> | target_table.date_col  | target_table.string_col  | 
> target_table.decimal_col  | s.date_col  | s.string_col  | s.decimal_col  |
> +------------------------+--------------------------+---------------------------+-------------+---------------+----------------+
> +------------------------+--------------------------+---------------------------+-------------+---------------+----------------+
>  {code}
> The workaround which fetches the correct result here is to set the below 
> config to false - 
> {code:java}
> set hive.convert.join.bucket.mapjoin.tez=false;{code}
> Notes from investigation - 
> 1. The batch containing the 2 results are forwarded correctly to the map join 
> operator. However, during the join comparision, the hash table is empty.
> 2. The problem seems to be that HashTableDummyOperator performs loading of 
> hash table with the records, however the map join operator does not take into 
> account all the hash tables from various instances of HashTableDummyOperator 
> (due to multiple map tasks initiated by bucket map join) but rather uses only 
> one hash table from one of the HashTableDummyOperator instance. In this case, 
> the selected instance had an empty hash table hence no records were matched 
> in the join operator.
> 3. If the table is unbucketed / 1-bucketed, then the results are correct. 
> There is only 1 map task which is spawned which loads the records into the 
> hash table. The workaround (setting *hive.convert.join.bucket.mapjoin.tez* to 
> {*}false{*}) also has the same effect since there is 1 map task which loads 
> the records into the hash table.
> 4. HashTableDummyOperator is created in the optimizer and is associated with 
> the plan, hence suspecting there is a some issue in the optimizer code. 
> Ideally, all hash tables from all instances of HashTableDummyOperator must be 
> used by the map join operator.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HIVE-27267) Incorrect results when doing bucket map join on decimal bucketed column with subquery

Reply via email to