[jira] [Work logged] (HIVE-24817) "not in" clause returns incorrect data when there is coercion

ASF GitHub Bot (Jira) Tue, 02 Mar 2021 03:53:30 -0800


     [ 
https://issues.apache.org/jira/browse/HIVE-24817?focusedWorklogId=559871&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-559871
 ]


ASF GitHub Bot logged work on HIVE-24817:
-----------------------------------------

                Author: ASF GitHub Bot
            Created on: 02/Mar/21 11:50
            Start Date: 02/Mar/21 11:50
    Worklog Time Spent: 10m 
      Work Description: kgyrtkirk commented on a change in pull request #2027:
URL: https://github.com/apache/hive/pull/2027#discussion_r585391616



##########
File path: 
ql/src/java/org/apache/hadoop/hive/ql/parse/type/TypeCheckProcFactory.java
##########
@@ -1007,17 +1001,12 @@ protected T getXpathOrFuncExprNodeDesc(ASTNode node,
             T columnDesc = children.get(0);
             T valueDesc = interpretNode(columnDesc, children.get(i));
             if (valueDesc == null) {
-              if (hasNullValue) {
-                // Skip if null value has already been added
-                continue;
-              }
-              TypeInfo targetType = exprFactory.getTypeInfo(columnDesc);
+              // Keep original
+              TypeInfo targetType = exprFactory.getTypeInfo(children.get(i));
               if (!expressions.containsKey(targetType)) {
                 expressions.put(targetType, columnDesc);
               }
-              T nullConst = exprFactory.createConstantExpr(targetType, null);
-              expressions.put(targetType, nullConst);
-              hasNullValue = true;
+              expressions.put(targetType, children.get(i));
             } else {

Review comment:
       I was going thru here and there and I think there might be another way 
around this problem which could retain this optimization as well:
   * introduce a new `NOT` operator: which can be controlled to return 
true/false in case of null values
   * in case of filter expressions start using the new not operator; and switch 
mode below every `NOT` operator
   
   but this feels like a more complicated change - we should only do it if we 
loose important optimizations

##########
File path: 
ql/src/java/org/apache/hadoop/hive/ql/parse/type/TypeCheckProcFactory.java
##########
@@ -1007,17 +1001,12 @@ protected T getXpathOrFuncExprNodeDesc(ASTNode node,
             T columnDesc = children.get(0);
             T valueDesc = interpretNode(columnDesc, children.get(i));
             if (valueDesc == null) {
-              if (hasNullValue) {
-                // Skip if null value has already been added
-                continue;
-              }
-              TypeInfo targetType = exprFactory.getTypeInfo(columnDesc);
+              // Keep original
+              TypeInfo targetType = exprFactory.getTypeInfo(children.get(i));
               if (!expressions.containsKey(targetType)) {
                 expressions.put(targetType, columnDesc);
               }
-              T nullConst = exprFactory.createConstantExpr(targetType, null);
-              expressions.put(targetType, nullConst);
-              hasNullValue = true;
+              expressions.put(targetType, children.get(i));
             } else {
               TypeInfo targetType = exprFactory.getTypeInfo(valueDesc);
               if (!expressions.containsKey(targetType)) {

Review comment:
       this if statement has no effect - the map value will be overwritten 
anyway ; I wonder if we have a bug here

##########
File path: 
ql/src/java/org/apache/hadoop/hive/ql/parse/type/TypeCheckProcFactory.java
##########
@@ -1007,17 +1001,12 @@ protected T getXpathOrFuncExprNodeDesc(ASTNode node,
             T columnDesc = children.get(0);
             T valueDesc = interpretNode(columnDesc, children.get(i));
             if (valueDesc == null) {
-              if (hasNullValue) {
-                // Skip if null value has already been added
-                continue;
-              }
-              TypeInfo targetType = exprFactory.getTypeInfo(columnDesc);
+              // Keep original
+              TypeInfo targetType = exprFactory.getTypeInfo(children.get(i));
               if (!expressions.containsKey(targetType)) {
                 expressions.put(targetType, columnDesc);
               }
-              T nullConst = exprFactory.createConstantExpr(targetType, null);
-              expressions.put(targetType, nullConst);
-              hasNullValue = true;
+              expressions.put(targetType, children.get(i));

Review comment:
       for `IN` the original logic is valid as long as it's in 
`UnknownAs.FALSE` mode...but for `NOT IN` the correct interpretation would be 
`UnknownAs.TRUE`.
   
   I think we might be better off not coping with the `UnknownAs` devils here - 
and retain the original expressions as in the current proposed patch; I'm not 
sure how much optimization opportunities/performance we will loose that way.
   
   

##########
File path: 
ql/src/java/org/apache/hadoop/hive/ql/parse/type/TypeCheckProcFactory.java
##########
@@ -1007,17 +1001,12 @@ protected T getXpathOrFuncExprNodeDesc(ASTNode node,
             T columnDesc = children.get(0);
             T valueDesc = interpretNode(columnDesc, children.get(i));
             if (valueDesc == null) {
-              if (hasNullValue) {
-                // Skip if null value has already been added
-                continue;
-              }
-              TypeInfo targetType = exprFactory.getTypeInfo(columnDesc);
+              // Keep original
+              TypeInfo targetType = exprFactory.getTypeInfo(children.get(i));
               if (!expressions.containsKey(targetType)) {
                 expressions.put(targetType, columnDesc);
               }
-              T nullConst = exprFactory.createConstantExpr(targetType, null);
-              expressions.put(targetType, nullConst);
-              hasNullValue = true;
+              expressions.put(targetType, children.get(i));
             } else {

Review comment:
       other idea could be to retain and do these optimizations but only if we 
are not below a NOT operator

##########
File path: ql/src/test/queries/clientpositive/in_coercion.q
##########
@@ -0,0 +1,14 @@
+DROP TABLE src_table;
+CREATE TABLE src_table (key int);
+LOAD DATA LOCAL INPATH '../../data/files/kv6.txt' INTO TABLE src_table;
+
+-- verify table has data
+select count(*) from src_table; 

Review comment:
       note: we may use the `assert_true` udf here (so that no one will be able 
to just silently overwrite the q.out)
   ```
   select assert_true(count(*) = 100) from src_table
   ```




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
-------------------

    Worklog Id:     (was: 559871)
    Time Spent: 0.5h  (was: 20m)

> "not in" clause returns incorrect data when there is coercion
> -------------------------------------------------------------
>
>                 Key: HIVE-24817
>                 URL: https://issues.apache.org/jira/browse/HIVE-24817
>             Project: Hive
>          Issue Type: Bug
>          Components: CBO
>            Reporter: Steve Carlin
>            Priority: Major
>              Labels: pull-request-available
>          Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> When the query has a where clause that has an integer column checking against 
> being "not in" a decimal column, the decimal column is being changed to null, 
> causing incorrect results.
> This is a sample query of a failure:
> select count(*) from my_tbl where int_col not in (355.8);
> Since the int_col can never be 355.8, one would expect all the rows to be 
> returned, but it is changing the 355.8 into a null value causing no rows to 
> be returned.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (HIVE-24817) "not in" clause returns incorrect data when there is coercion

Reply via email to