[ https://issues.apache.org/jira/browse/HIVE-27226?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17798483#comment-17798483 ]
Denys Kuzmenko commented on HIVE-27226: --------------------------------------- thanks [~seonggon]! Would be great if you could submit PR with the above change only against branch-4.0. > FullOuterJoin with filter expressions is not computed correctly > --------------------------------------------------------------- > > Key: HIVE-27226 > URL: https://issues.apache.org/jira/browse/HIVE-27226 > Project: Hive > Issue Type: Bug > Affects Versions: 4.0.0-beta-1 > Reporter: Seonggon Namgung > Priority: Critical > Labels: hive-4.0.0-must > > I tested many OuterJoin queries as an extension of HIVE-27138, and I found > that Hive returns incorrect result for a query containing FullOuterJoin with > filter expressions. In a nutshell, all JoinOperators that run on Tez engine > return incorrect result for OuterJoin queries, and one of the reason for > incorrect computation comes from CommonJoinOperator, which is the base of all > JoinOperators. I attached the queries and configuration that I used at the > bottom of the document. I am still inspecting this problems, and I will share > an update once when I find out another reason. Also any comments and opinions > would be appreciated. > First of all, I observed that current Hive ignores filter expressions > contained in MapJoinOperator. For example, the attached result of query1 > shows that MapJoinOperator performs inner join, not full outer join. This > problem stems from removal of filterMap. When converting JoinOperator to > MapJoinOperator, ConvertJoinMapJoin#convertJoinDynamicPartitionedHashJoin() > removes filterMap of MapJoinOperator. Because MapJoinOperator does not > evaluate filter expressions if filterMap is null, this change makes > MapJoinOperator ignore filter expressions and it always joins tables > regardless whether they satisfy filter expressions or not. To solve this > problem, I disable FullOuterMapJoinOptimization and apply path for > HIVE-27138, which prevents NPE. (The patch is available at the following > link: LINK.) The rest of this document uses this modified Hive, but most of > problems happen to current Hive, too. > The second problem I found is that Hive returns the same left-null or > right-null rows multiple time when it uses MapJoinOperator or > CommonMergeJoinOperator. This is caused by the logic of current > CommonJoinOperator. Both of the two JoinOperators joins tables in 2 steps. > First, they create RowContainers, each of which is a group of rows from one > table and has the same key. Second, they call > CommonJoinOperator#checkAndGenObject() with created RowContainers. This > method checks filterTag of each row in RowContainers and forwards joined row > if they meet all filter conditions. For OuterJoin, checkAndGenObject() > forwards non-matching rows if there is no matching row in RowContainer. The > problem happens when there are multiple RowContainer for the same key and > table. For example, suppose that there are two left RowContainers and one > right RowContainer. If none of the row in two left RowContainers satisfies > filter condition, then checkAndGenObject() will forward Left-Null row for > each right row. Because checkAndGenObject() is called with each left > RowContainer, there will be two duplicated Left-Null rows for every right row. > In the case of MapJoinOperator, it always creates singleton RowContainer for > big table. Therefore, it always produces duplicated non-matching rows. > CommonMergeJoinOperator also creates multiple RowContainer for big table, > whose size is hive.join.emit.interval. In the below experiment, I also set > hive.join.shortcut.unmatched.rows=false, and hive.exec.reducers.max=1 to > disable specialized algorithm for OuterJoin of 2 tables and force calling > checkAndGenObject() before all rows with the same keys are gathered. I didn't > observe this problem when using VectorMapJoinOperator, and I will inspect > VectorMapJoinOperator whether we can reproduce the problem with it. > I think the second problem is not limited to FullOuterJoin, but I couldn't > find such query as of now. This will also be added to this issue if I can > write a query that reproduces the second problem without FullOuterJoin. > I also found that Hive returns wrong result for query2 even when I used > VectorMapJoinOperator. I am still inspecting this problem and I will add an > update on it when I find out the reason. > > Experiment: > > {code:java} > ---- Configuration > set hive.optimize.shared.work=false; > -- Std MapJoin > set hive.auto.convert.join=true; > set hive.vectorized.execution.enabled=false; > -- Vec MapJoin > set hive.auto.convert.join=true; > set hive.vectorized.execution.enabled=true; > -- MergeJoin > set hive.auto.convert.join=false; > set hive.vectorized.execution.enabled=false; > set hive.join.shortcut.unmatched.rows=false; > set hive.join.emit.interval=1; > set hive.exec.reducers.max=1; > > ---- Queries > -- Query 1 > DROP TABLE IF EXISTS a; > CREATE TABLE a (key string, value string); > INSERT INTO a VALUES (1, 1), (1, 2), (2, 1); > SELECT * FROM a FULL OUTER JOIN a b ON a.key = b.key AND a.key < 0; > -- Query 2 > DROP TABLE IF EXISTS b; > CREATE TABLE b (key string, value string); > INSERT INTO b VALUES (1, 0), (1, 1); > SELECT * FROM b FULL OUTER JOIN b a ON a.key = b.key AND a.value > 0 AND > b.value > 0;{code} > > > Experiment result: > > {code:java} > -- PostgresSQL > -- Query1 > key | value | key | value > -----+-------+-----+------- > 1 | 1 | | > 1 | 2 | | > 2 | 1 | | > | | 1 | 2 > | | 1 | 1 > | | 2 | 1 > (6 rows) > -- Query2 > key | value | key | value > -----+-------+-----+------- > 1 | 0 | | > 1 | 1 | 1 | 1 > | | 1 | 0 > (3 rows){code} > {code:java} > -- Query1 Result, current Hive > -- Std MapJoin > +--------+----------+--------+----------+ > | a.key | a.value | b.key | b.value | > +--------+----------+--------+----------+ > | 2 | 1 | 2 | 1 | > | 1 | 2 | 1 | 2 | > | 1 | 2 | 1 | 1 | > | 1 | 1 | 1 | 2 | > | 1 | 1 | 1 | 1 | > +--------+----------+--------+----------+ > -- Vec MapJoin > +--------+----------+--------+----------+ > | a.key | a.value | b.key | b.value | > +--------+----------+--------+----------+ > | 1 | 2 | NULL | NULL | > | 1 | 1 | NULL | NULL | > | NULL | NULL | 1 | 2 | > | NULL | NULL | 1 | 1 | > | 2 | 1 | NULL | NULL | > | NULL | NULL | 2 | 1 | > +--------+----------+--------+----------+ > -- MergeJoin > +--------+----------+--------+----------+ > | a.key | a.value | b.key | b.value | > +--------+----------+--------+----------+ > | 1 | 1 | NULL | NULL | > | NULL | NULL | 1 | 1 | > | NULL | NULL | 1 | 2 | > | 1 | 2 | NULL | NULL | > | NULL | NULL | 1 | 1 | > | NULL | NULL | 1 | 2 | > | 2 | 1 | NULL | NULL | > | NULL | NULL | 2 | 1 | > +--------+----------+--------+----------+ {code} > {code:java} > -- Query1 Result, Hive with HIVE-27138 patch, disable > FullOuterMapJoinOptimization > -- Std MapJoin > +--------+----------+--------+----------+ > | a.key | a.value | b.key | b.value | > +--------+----------+--------+----------+ > | 2 | 1 | NULL | NULL | > | NULL | NULL | 2 | 1 | > | 1 | 2 | NULL | NULL | > | NULL | NULL | 1 | 2 | > | NULL | NULL | 1 | 1 | > | 1 | 1 | NULL | NULL | > | NULL | NULL | 1 | 2 | > | NULL | NULL | 1 | 1 | > +--------+----------+--------+----------+ > -- Vec MapJoin > +--------+----------+--------+----------+ > | a.key | a.value | b.key | b.value | > +--------+----------+--------+----------+ > | 1 | 2 | NULL | NULL | > | 1 | 1 | NULL | NULL | > | NULL | NULL | 1 | 2 | > | NULL | NULL | 1 | 1 | > | 2 | 1 | NULL | NULL | > | NULL | NULL | 2 | 1 | > +--------+----------+--------+----------+ > -- MergeJoin > +--------+----------+--------+----------+ > | a.key | a.value | b.key | b.value | > +--------+----------+--------+----------+ > | 1 | 1 | NULL | NULL | > | NULL | NULL | 1 | 1 | > | NULL | NULL | 1 | 2 | > | 1 | 2 | NULL | NULL | > | NULL | NULL | 1 | 1 | > | NULL | NULL | 1 | 2 | > | 2 | 1 | NULL | NULL | > | NULL | NULL | 2 | 1 | > +--------+----------+--------+----------+ {code} > {code:java} > -- Query2 Result, current Hive > -- Std MapJoin > +--------+----------+--------+----------+ > | b.key | b.value | a.key | a.value | > +--------+----------+--------+----------+ > | 1 | 0 | NULL | NULL | > | NULL | NULL | 1 | 0 | > | NULL | NULL | 1 | 1 | > | 1 | 1 | 1 | 0 | > | 1 | 1 | 1 | 1 | > +--------+----------+--------+----------+ > -- Vec MapJoin > +--------+----------+--------+----------+ > | b.key | b.value | a.key | a.value | > +--------+----------+--------+----------+ > | 1 | 0 | NULL | NULL | > | 1 | 1 | 1 | 1 | > | 1 | 1 | 1 | 0 | > +--------+----------+--------+----------+ > -- MergeJoin > +--------+----------+--------+----------+ > | b.key | b.value | a.key | a.value | > +--------+----------+--------+----------+ > | 1 | 0 | NULL | NULL | > | NULL | NULL | 1 | 0 | > | 1 | 1 | 1 | 1 | > +--------+----------+--------+----------+ {code} > {code:java} > -- Query2 Result, Hive with HIVE-27138 patch, disable > FullOuterMapJoinOptimization > -- Std MapJoin > +--------+----------+--------+----------+ > | b.key | b.value | a.key | a.value | > +--------+----------+--------+----------+ > | 1 | 1 | 1 | 1 | > | 1 | 1 | 1 | 0 | > | 1 | 0 | 1 | 1 | > | 1 | 0 | 1 | 0 | > +--------+----------+--------+----------+ > -- Vec MapJoin > +--------+----------+--------+----------+ > | b.key | b.value | a.key | a.value | > +--------+----------+--------+----------+ > | 1 | 0 | NULL | NULL | > | 1 | 1 | 1 | 0 | > | 1 | 1 | 1 | 1 | > +--------+----------+--------+----------+ > -- MergeJoin > +--------+----------+--------+----------+ > | b.key | b.value | a.key | a.value | > +--------+----------+--------+----------+ > | 1 | 0 | NULL | NULL | > | NULL | NULL | 1 | 0 | > | 1 | 1 | 1 | 1 | > +--------+----------+--------+----------+ {code} > -- This message was sent by Atlassian Jira (v8.20.10#820010)