Mahesh Raju Somalaraju created HIVE-28503:
---------------------------------------------

             Summary: Wrong results(NULL) when string concat operation with || 
operator for ORC file format when vectorization enabled
                 Key: HIVE-28503
                 URL: https://issues.apache.org/jira/browse/HIVE-28503
             Project: Hive
          Issue Type: Bug
          Components: Hive
            Reporter: Mahesh Raju Somalaraju
            Assignee: Mahesh Raju Somalaraju


Wrong results(NULL) when string concat operation with || operator for ORC file 
format when vectorization enabled.

set hive.query.results.cache.enabled=false;

set hive.fetch.task.conversion=none;

set hive.vectorized.execution.enabled=true;

Result is NULL when we do concat operation with || operator. Locally it is not 
able to reproduce. It is able to reproduce in cluster with more records.Input 
data should be mix of NULL and NOT NULL values something like below.

create a table with orc file format and has 3 string columns and insert data 
such way that it should have mix of NULL values and NOT NULL values.

 
|column1|column2|column3|count|
|NULL      |NULL     |NULL      |18000  |
|G         |L        |A1        |123932 |

with above configuration, perform concat() operation with || operator and 
insert new row with the concat() results.

select * from (select t1.column1, t1.column2, t1.column3,  *t1.column1 || 
t1.column2 || t1.column3 as VEH_MODEL_ID*

from test_table t1 )t where VEH_MODEL_ID is NULL and if(column1 is null,0,1)=1 
AND if(column2 is null,0,1)=1 AND if(column3 is null,0,1)=1 limit 1;

in above query, *t1.column1 || t1.column2 || t1.column3 as VEH_MODEL_ID*  
operation is returning the NULL result eventhough the input string values are 
not null.
|t.VEH_MODEL_ID|t.column1|t.column2|t.column3|
|NULL|G|L|A2|

 

+Proposed solution as per code review:+

+*Root cause:*+

While doing concat() operation, In *StringGroupConcatColCol* class, if input 
batch vector has mixed of NULL and NOT NULL values of inputs then we are not 
setting output vector batch flags related to NULL and NOT NULLS correctly . 
Each value in the vector has the flag whether it is NULL or NOT NULL. But  here 
we are not setting correctly the whole output vector flag (outV.noNulls).  
Without this flag it is working for parquet, some how they may be referring 
each value instead of checking whole output vector flag whether it is NULL or 
NOT NULL.

+*code snippet:*+

*StringGroupConcatColCol->evaluate() method:*

if (inV1.noNulls && !inV2.noNulls) {   *>>  if any one input has NULL, then 
output should be NULL.* 

outV.noNulls = false; *--> setting this flag false as all values in this are 
NULLs*

----

}

else if (!inV1.noNulls && inV2.noNulls) { *>>  if any one input has NULL, then 
output should be NULL.*

outV.noNulls = false; --> *setting this flag false as all values in this are 
NULLs* 

---

}

else if (!inV1.noNulls && !inV2.noNulls) {  *>>  if two inputs are NULL, then 
output should be NULL.*

outV.noNulls = false; *--> setting this flag false as all values in this are 
NULLs**

---

}

else {                  *--> there are no nulls in either input vector*

{color:#4c9aff}*outV.noNulls = true;  --> this has to be set true, as there are 
no NULL values, this check is missed currently.*{color}
// perform data operation

---

}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to