[ https://issues.apache.org/jira/browse/HIVE-27267?focusedWorklogId=860806&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-860806 ]
ASF GitHub Bot logged work on HIVE-27267: ----------------------------------------- Author: ASF GitHub Bot Created on: 05/May/23 17:19 Start Date: 05/May/23 17:19 Worklog Time Spent: 10m Work Description: sonarcloud[bot] commented on PR #4296: URL: https://github.com/apache/hive/pull/4296#issuecomment-1536555414 Kudos, SonarCloud Quality Gate passed! [](https://sonarcloud.io/dashboard?id=apache_hive&pullRequest=4296) [](https://sonarcloud.io/project/issues?id=apache_hive&pullRequest=4296&resolved=false&types=BUG) [](https://sonarcloud.io/project/issues?id=apache_hive&pullRequest=4296&resolved=false&types=BUG) [0 Bugs](https://sonarcloud.io/project/issues?id=apache_hive&pullRequest=4296&resolved=false&types=BUG) [](https://sonarcloud.io/project/issues?id=apache_hive&pullRequest=4296&resolved=false&types=VULNERABILITY) [](https://sonarcloud.io/project/issues?id=apache_hive&pullRequest=4296&resolved=false&types=VULNERABILITY) [0 Vulnerabilities](https://sonarcloud.io/project/issues?id=apache_hive&pullRequest=4296&resolved=false&types=VULNERABILITY) [](https://sonarcloud.io/project/security_hotspots?id=apache_hive&pullRequest=4296&resolved=false&types=SECURITY_HOTSPOT) [](https://sonarcloud.io/project/security_hotspots?id=apache_hive&pullRequest=4296&resolved=false&types=SECURITY_HOTSPOT) [0 Security Hotspots](https://sonarcloud.io/project/security_hotspots?id=apache_hive&pullRequest=4296&resolved=false&types=SECURITY_HOTSPOT) [](https://sonarcloud.io/project/issues?id=apache_hive&pullRequest=4296&resolved=false&types=CODE_SMELL) [](https://sonarcloud.io/project/issues?id=apache_hive&pullRequest=4296&resolved=false&types=CODE_SMELL) [4 Code Smells](https://sonarcloud.io/project/issues?id=apache_hive&pullRequest=4296&resolved=false&types=CODE_SMELL) [](https://sonarcloud.io/component_measures?id=apache_hive&pullRequest=4296&metric=coverage&view=list) No Coverage information [](https://sonarcloud.io/component_measures?id=apache_hive&pullRequest=4296&metric=duplicated_lines_density&view=list) No Duplication information Issue Time Tracking ------------------- Worklog Id: (was: 860806) Time Spent: 20m (was: 10m) > Incorrect results when doing bucket map join on decimal bucketed column with > subquery > ------------------------------------------------------------------------------------- > > Key: HIVE-27267 > URL: https://issues.apache.org/jira/browse/HIVE-27267 > Project: Hive > Issue Type: Bug > Reporter: Sourabh Badhya > Assignee: Seonggon Namgung > Priority: Major > Labels: pull-request-available > Time Spent: 20m > Remaining Estimate: 0h > > The following queries when run on a Hive cluster produce no results - > Repro queries - > {code:java} > set hive.txn.manager=org.apache.hadoop.hive.ql.lockmgr.DbTxnManager; > set hive.support.concurrency=true; > set hive.convert.join.bucket.mapjoin.tez=true; > drop table if exists test_external_source; > create external table test_external_source (date_col date, string_col string, > decimal_col decimal(38,0)) stored as orc tblproperties > ('external.table.purge'='true'); > insert into table test_external_source values ('2022-08-30', 'pipeline', > '50000000000000000005905545593'), ('2022-08-16', 'pipeline', > '50000000000000000005905545593'), ('2022-09-01', 'pipeline', > '50000000000000000006008686831'), ('2022-08-30', 'pipeline', > '50000000000000000005992620837'), ('2022-09-01', 'pipeline', > '50000000000000000005992620837'), ('2022-09-01', 'pipeline', > '50000000000000000005992621067'), ('2022-08-30', 'pipeline', > '50000000000000000005992621067'); > drop table if exists test_external_target; > create external table test_external_target (date_col date, string_col string, > decimal_col decimal(38,0)) stored as orc tblproperties > ('external.table.purge'='true'); > insert into table test_external_target values ('2017-05-17', 'pipeline', > '50000000000000000000441610525'), ('2018-12-20', 'pipeline', > '50000000000000000001048981030'), ('2020-06-30', 'pipeline', > '50000000000000000002332575516'), ('2021-08-16', 'pipeline', > '50000000000000000003897973989'), ('2017-06-06', 'pipeline', > '50000000000000000000449148729'), ('2017-09-08', 'pipeline', > '50000000000000000000525378314'), ('2022-08-30', 'pipeline', > '50000000000000000005905545593'), ('2022-08-16', 'pipeline', > '50000000000000000005905545593'), ('2018-05-03', 'pipeline', > '50000000000000000000750826355'), ('2020-01-10', 'pipeline', > '50000000000000000001816579677'), ('2021-11-01', 'pipeline', > '50000000000000000004269423714'), ('2017-11-07', 'pipeline', > '50000000000000000000585901787'), ('2019-10-15', 'pipeline', > '50000000000000000001598843430'), ('2020-04-01', 'pipeline', > '50000000000000000002035795461'), ('2020-02-24', 'pipeline', > '50000000000000000001932600185'), ('2020-04-27', 'pipeline', > '50000000000000000002108160849'), ('2016-07-05', 'pipeline', > '50000000000000000000054405114'), ('2020-06-02', 'pipeline', > '50000000000000000002234387967'), ('2020-08-21', 'pipeline', > '50000000000000000002529168758'), ('2021-02-17', 'pipeline', > '50000000000000000003158511687'); > drop table if exists target_table; > drop table if exists source_table; > create table target_table(date_col date, string_col string, decimal_col > decimal(38,0)) clustered by (decimal_col) into 7 buckets stored as orc > tblproperties ('bucketing_version'='2', 'transactional'='true', > 'transactional_properties'='default'); > create table source_table(date_col date, string_col string, decimal_col > decimal(38,0)) clustered by (decimal_col) into 7 buckets stored as orc > tblproperties ('bucketing_version'='2', 'transactional'='true', > 'transactional_properties'='default'); > insert into table target_table select * from test_external_target; > insert into table source_table select * from test_external_source; {code} > Query which is under investigation - > {code:java} > select * from target_table inner join (select distinct date_col, 'pipeline' > string_col, decimal_col from source_table where coalesce(decimal_col,'') = > '50000000000000000005905545593') s on s.date_col = target_table.date_col AND > s.string_col = target_table.string_col AND s.decimal_col = > target_table.decimal_col; {code} > Expected result of the query - 2 records > {code:java} > +------------------------+--------------------------+--------------------------------+-------------+---------------+--------------------------------+ > | target_table.date_col | target_table.string_col | > target_table.decimal_col | s.date_col | s.string_col | > s.decimal_col | > +------------------------+--------------------------+--------------------------------+-------------+---------------+--------------------------------+ > | 2022-08-16 | pipeline | > 50000000000000000005905545593 | 2022-08-16 | pipeline | > 50000000000000000005905545593 | > | 2022-08-30 | pipeline | > 50000000000000000005905545593 | 2022-08-30 | pipeline | > 50000000000000000005905545593 | > +------------------------+--------------------------+--------------------------------+-------------+---------------+--------------------------------+ > {code} > Actual result of the query - No records > {code:java} > +------------------------+--------------------------+---------------------------+-------------+---------------+----------------+ > | target_table.date_col | target_table.string_col | > target_table.decimal_col | s.date_col | s.string_col | s.decimal_col | > +------------------------+--------------------------+---------------------------+-------------+---------------+----------------+ > +------------------------+--------------------------+---------------------------+-------------+---------------+----------------+ > {code} > The workaround which fetches the correct result here is to set the below > config to false - > {code:java} > set hive.convert.join.bucket.mapjoin.tez=false;{code} > Notes from investigation - > 1. The batch containing the 2 results are forwarded correctly to the map join > operator. However, during the join comparision, the hash table is empty. > 2. The problem seems to be that even though HashTableDummyOperator performs > loading of hash table with the records, however the map join operator does > not take into account all the hash tables from various instances of > HashTableDummyOperator (due to multiple map tasks initiated by bucket map > join) but rather uses only one hash table from one of the > HashTableDummyOperator instance. In this case, the selected instance had an > empty hash table hence no records were matched in the join operator. > 3. If the table is unbucketed / 1-bucketed, then the results are correct. > There is only 1 map task which is spawned which loads the records into the > hash table. The workaround (setting *hive.convert.join.bucket.mapjoin.tez* to > {*}false{*}) also has the same effect since there is 1 map task which loads > the records into the hash table. > 4. HashTableDummyOperator is created in the optimizer and is associated with > the plan, hence suspecting there is a some issue in the optimizer code. > Ideally, all hash tables from all instances of HashTableDummyOperator must be > used by the map join operator. -- This message was sent by Atlassian Jira (v8.20.10#820010)