[ https://issues.apache.org/jira/browse/HIVE-23438?focusedWorklogId=461355&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-461355 ]
ASF GitHub Bot logged work on HIVE-23438: ----------------------------------------- Author: ASF GitHub Bot Created on: 21/Jul/20 00:33 Start Date: 21/Jul/20 00:33 Worklog Time Spent: 10m Work Description: github-actions[bot] closed pull request #1014: URL: https://github.com/apache/hive/pull/1014 ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking ------------------- Worklog Id: (was: 461355) Time Spent: 20m (was: 10m) > Missing Rows When Left Outer Join In N-way HybridGraceHashJoin > -------------------------------------------------------------- > > Key: HIVE-23438 > URL: https://issues.apache.org/jira/browse/HIVE-23438 > Project: Hive > Issue Type: Bug > Components: SQL, Tez > Affects Versions: 2.3.4 > Reporter: 范宜臻 > Assignee: 范宜臻 > Priority: Major > Labels: pull-request-available > Attachments: HIVE-23438.001.branch-2.3.patch, > HIVE-23438.branch-2.3.patch > > Time Spent: 20m > Remaining Estimate: 0h > > *Run Test in Patch File* > {code:java} > mvn test -Dtest=TestMiniTezCliDriver -Dqfile=hybridgrace_hashjoin_2.q{code} > *Manual Reproduce* > *STEP 1. Create test data(q_test_init_tez.sql)* > {code:java} > //create table src1 > CREATE TABLE src1 (key STRING COMMENT 'default', value STRING COMMENT > 'default') STORED AS TEXTFILE; > LOAD DATA LOCAL INPATH "${hiveconf:test.data.dir}/kv3.txt" INTO TABLE src1; > //create table src2 > CREATE TABLE src2(key STRING COMMENT 'default', value STRING COMMENT > 'default') STORED AS TEXTFILE; > LOAD DATA LOCAL INPATH "${hiveconf:test.data.dir}/kv11.txt" OVERWRITE INTO > TABLE src2; > //create table srcpart > CREATE TABLE srcpart (key STRING COMMENT 'default', value STRING COMMENT > 'default') > PARTITIONED BY (ds STRING, hr STRING) > STORED AS TEXTFILE; > LOAD DATA LOCAL INPATH "${hiveconf:test.data.dir}/kv1.txt" > OVERWRITE INTO TABLE srcpart PARTITION (ds="2008-04-08", hr="11"); > LOAD DATA LOCAL INPATH "${hiveconf:test.data.dir}/kv1.txt" > OVERWRITE INTO TABLE srcpart PARTITION (ds="2008-04-08", hr="12"); > LOAD DATA LOCAL INPATH "${hiveconf:test.data.dir}/kv1.txt" > OVERWRITE INTO TABLE srcpart PARTITION (ds="2008-04-09", hr="11"); > LOAD DATA LOCAL INPATH "${hiveconf:test.data.dir}/kv1.txt" > OVERWRITE INTO TABLE srcpart PARTITION (ds="2008-04-09", hr="12");{code} > *STEP 2. Run query* > {code:java} > set hive.auto.convert.join=true; > set hive.auto.convert.join.noconditionaltask=true; > set hive.auto.convert.join.noconditionaltask.size=10000000; > set hive.cbo.enable=false; > set hive.mapjoin.hybridgrace.hashtable=true; > select * > from > ( > select key from src1 group by key > ) x > left join src2 z on x.key = z.key > join > ( > select key from srcpart y group by key > ) y on y.key = x.key; > {code} > *EXPECTED RESULT*** > > {code:java} > 128 NULL NULL 128 > 146 146 1val_1461 146 > 150 150 1val_1501 150 > 238 NULL NULL 238 > 369 NULL NULL 369 > 406 406 1val_4061 406 > 273 273 1val_2731 273 > 98 NULL NULL 98 > 213 213 1val_2131 213 > 255 NULL NULL 255 > 401 401 1val_4011 401 > 278 NULL NULL 278 > 66 66 11val_6611 66 > 224 NULL NULL 224 > 311 NULL NULL 311 > {code} > > *ACTUAL RESULT* > {code:java} > 128 NULL NULL 128 > 146 146 1val_1461 146 > 150 150 1val_1501 150 > 213 213 1val_2131 213 > 238 NULL NULL 238 > 273 273 1val_2731 273 > 369 NULL NULL 369 > 406 406 1val_4061 406 > 98 NULL NULL 98 > 401 401 1val_4011 401 > 66 66 11val_6611 66 > {code} > > *ROOT CAUSE* > src1 left join src2, src1 is big table and src2 is small table. Join result > between big table row and the corresponding hashtable maybe NO_MATCH state, > however, these NO_MATCH rows is needed because LEFT OUTER JOIN. > In addition, these big table rows will not spilled into matchfile related to > this hashtable on disk because only SPILL state can use `spillBigTableRow`. > Then, these big table rows will be spilled into matchfile in hashtables of > table `srcpart`(second small table) > Finally, when reProcessBigTable, big table rows in matchfile are only read > from `firstSmallTable`, some datum are missing. > > *WORKAROUND* > configure firstSmallTable in completeInitializationOp and only spill big > table row into firstSmallTable when spill matchfile. > -- This message was sent by Atlassian Jira (v8.3.4#803005)