[jira] [Work logged] (HIVE-25822) Incorrect False positive result rows may be outputted in case outer join has conditions only affecting one side

ASF GitHub Bot (Jira) Mon, 20 Dec 2021 01:49:04 -0800


     [ 
https://issues.apache.org/jira/browse/HIVE-25822?focusedWorklogId=698522&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-698522
 ]


ASF GitHub Bot logged work on HIVE-25822:
-----------------------------------------

                Author: ASF GitHub Bot
            Created on: 20/Dec/21 09:48
            Start Date: 20/Dec/21 09:48
    Worklog Time Spent: 10m 
      Work Description: zabetak commented on a change in pull request #2891:
URL: https://github.com/apache/hive/pull/2891#discussion_r772195126



##########
File path: 
ql/src/java/org/apache/hadoop/hive/ql/exec/CommonMergeJoinOperator.java
##########
@@ -92,6 +94,7 @@
 
   // A field because we cannot multi-inherit.
   transient InterruptibleProcessing interruptChecker;
+  transient boolean shortcutExtensionRows;

Review comment:
       Minor: should we use `unmatched` everywhere we use `extension`?
   
   "Unmatched rows of an outer join" sounds more natural than "Extension rows 
of an outer join".

##########
File path: common/src/java/org/apache/hadoop/hive/conf/HiveConf.java
##########
@@ -1836,6 +1836,8 @@ private static void populateLlapDaemonVarsSet(Set<String> 
llapDaemonVarsSetLocal
     HIVEALIAS("hive.alias", "", ""),
     HIVEMAPSIDEAGGREGATE("hive.map.aggr", true, "Whether to use map-side 
aggregation in Hive Group By queries"),
     HIVEGROUPBYSKEW("hive.groupby.skewindata", false, "Whether there is skew 
in data to optimize group by queries"),
+    HIVE_JOIN_SHORTCUT_EXTENSIONROWS("hive.join.shortcut.extensionrows", true,
+        "Enables to shortcut processing of known filtered rows in merge 
joins."),

Review comment:
       I see your point but I still think that it would be better to avoid 
toggles when it comes to correctness issues. I guess it is not the first time 
we are doing this so I am not gonna push back on this but it shouldn't be the 
norm. I am leaving the final judgement up to you.
   
   If it goes in I would add some extra information that it is for "internal 
use only" and changing the value can lead to incorrect results.

##########
File path: ql/src/test/results/clientpositive/llap/join_1to1.q.out
##########
@@ -597,22 +592,10 @@ NULL      NULL    NULL    10      10010   66
 NULL   NULL    NULL    25      10025   66
 NULL   NULL    NULL    30      10030   88
 NULL   NULL    NULL    35      10035   88
-NULL   NULL    NULL    40      10040   66
 NULL   NULL    NULL    40      10040   88
-NULL   NULL    NULL    40      10040   88
-NULL   NULL    NULL    50      10050   66
-NULL   NULL    NULL    50      10050   66
-NULL   NULL    NULL    50      10050   66
-NULL   NULL    NULL    50      10050   66
-NULL   NULL    NULL    50      10050   88
 NULL   NULL    NULL    50      10050   88
-NULL   NULL    NULL    50      10050   88
-NULL   NULL    NULL    70      10040   88
 NULL   NULL    NULL    70      10040   88
 NULL   NULL    NULL    70      10040   88

Review comment:
       I didn't verify that all the changes in this file are correct. Should I 
do it or you went over them already?

##########
File path: 
ql/src/java/org/apache/hadoop/hive/ql/exec/CommonMergeJoinOperator.java
##########
@@ -302,7 +322,47 @@ public void process(Object row, int tag) throws 
HiveException {
 
     assert !nextKeyGroup;
     candidateStorage[tag].addRow(value);
+  }
+
+  private void emitExtensionRow(int tag, List<Object> value) throws 
HiveException {
+    extensionStorage[tag].addRow(value);
+    for (byte i = 0; i < order.length; i++) {
+      if (i == tag) {
+        storage[i] = extensionStorage[i];
+      } else {
+        putDummyOrEmpty(i);
+      }
+    }
+    checkAndGenObject();
+    extensionStorage[tag].clearRows();
+  }
 
+  /**
+   * Decides if the actual row must be an extension row.
+   *
+   * Extension rows are those which are not part of the inner-join.
+   * May not correctly identify all extension rows - but will remove trivially 
filtered ones.

Review comment:
       Since you have examples that do not work, maybe it would be better to 
write "The current implementation, does not correctly identify all..." and put 
a reference to a follow up JIRA highlighting that there is an implementation 
limitation.

##########
File path: ql/src/test/results/clientpositive/llap/join_1to1.q.out
##########
@@ -373,9 +373,6 @@ NULL        NULL    NULL    25      10025   66
 NULL   NULL    NULL    30      10030   88
 NULL   NULL    NULL    35      10035   88
 NULL   NULL    NULL    40      10040   88
-NULL   NULL    NULL    50      10050   66
-NULL   NULL    NULL    50      10050   66
-NULL   NULL    NULL    50      10050   88

Review comment:
       I am not a big fan of the the `src` and similar tables but if people 
validated the results against another DBMS I can leave with it.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: gitbox-unsubscr...@hive.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
-------------------

    Worklog Id:     (was: 698522)
    Time Spent: 1h  (was: 50m)

> Incorrect False positive result rows may be outputted in case outer join has 
> conditions only affecting one side
> ---------------------------------------------------------------------------------------------------------------
>
>                 Key: HIVE-25822
>                 URL: https://issues.apache.org/jira/browse/HIVE-25822
>             Project: Hive
>          Issue Type: Bug
>            Reporter: Zoltan Haindrich
>            Assignee: Zoltan Haindrich
>            Priority: Major
>              Labels: pull-request-available
>          Time Spent: 1h
>  Remaining Estimate: 0h
>
> needed
> * outer join
> * on condition has at least one condition for one side of the join
> * in a single reducer:
> ** a right hand side only row outputted right before
> ** >=2 rows on LHS and 1 on RHS matching in the join keys but the first LHS 
> doesn't satisfies the filter condition
> ** second LHS row with good filter condition
> {code}
> with
> t_y as (select col1 as id,col2 as s from (VALUES(0,'a'),(1,'y')) as c),
> t_xy as (select col1 as id,col2 as s from (VALUES(1,'x'),(1,'y')) as c) 
> select * from t_xy l full outer join t_y r on (l.id=r.id and l.s='y');
> {code}
> null,null,1,y is a false positive result
> {code}
> +-------+-------+-------+-------+
> | l.id  |  l.s  | r.id  |  r.s  |
> +-------+-------+-------+-------+
> | NULL  | NULL  | 0     | a     |
> | 1     | x     | NULL  | NULL  |
> | NULL  | NULL  | 1     | y     |
> | 1     | y     | 1     | y     |
> +-------+-------+-------+-------+
> {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Work logged] (HIVE-25822) Incorrect False positive result rows may be outputted in case outer join has conditions only affecting one side

Reply via email to