RE: Vectorization with UDFs returns incorrect results

java8964 Fri, 30 May 2014 07:53:40 -0700

When you turn "vectorized" on, does the following query consistently return 
10000 in the output?
select ten_thousand() from testTabOrc
Yong


Date: Fri, 30 May 2014 08:24:43 -0400
Subject: Vectorization with UDFs returns incorrect results
From: bbowman...@gmail.com
To: user@hive.apache.org

Hive 0.13 & Hadoop 2.4
I am having an issue when using the combination of vectorized query execution, 
BETWEEN, and a custom UDF.  When I have vectorization on, my query returns an 
empty set.  When I then turn vectorization off, my query returns the correct 
results. 

Example Query:  SELECT column_1 FROM table_1 WHERE column_1 BETWEEN (UDF_1 - X) 
and UDF_1
My UDFs seem to be working for everything else except this specific 
circumstance.  Is this a issue in the hive software or am I writing my UDFs in 
such a way that they do not work with vectorization?  If the latter, what is 
the correct way?

I created a test scenario where I was able to reproduce this problem I am 
seeing:
TEST UDF (SIMPLE FUNCTION THAT TAKES NO ARGUMENTS AND RETURNS 10000):  
package com.test;

import org.apache.hadoop.hive.ql.exec.Description;import 
org.apache.hadoop.hive.ql.exec.UDF;
import org.apache.hadoop.io.LongWritable;import 
org.apache.hadoop.io.Text;import java.lang.String;
import java.lang.*;
public class tenThousand extends UDF {

  private final LongWritable result = new LongWritable();

  public LongWritable evaluate() {    result.set(10000);    return result;
  }}
TEST DATA (test.input):1|CBCABC|12
2|DBCABC|133|EBCABC|1440000|ABCABC|1550000|BBCABC|16
60000|CBCABC|17
CREATING ORC TABLE:
0: jdbc:hive2://server:10002/db> create table testTabOrc (first bigint, second 
varchar(20), third int) partitioned by (range int) clustered by (first) sorted 
by (first) into 8 buckets stored as orc tblproperties ("orc.compress" = 
"SNAPPY", "orc.index" = "true");

CREATE LOADING TABLE:0: jdbc:hive2://server:10002/db> create table loadingDir 
(first bigint, second varchar(20), third int) partitioned by (range int) row 
format delimited fields terminated by '|' stored as textfile;

COPY IN DATA:[root@server]#  hadoop fs -copyFromLocal /tmp/test.input 
/db/loading/.

ORC DATA:[root@server]#  beeline -u jdbc:hive2://server:10002/db -n root 
--hiveconf hive.exec.dynamic.partition.mode=nonstrict --hiveconf 
hive.enforce.sorting=true -e "insert into table testTabOrc partition(range) 
select * from loadingDir;"

LOAD TEST FUNCTION:0: jdbc:hive2://server:10002/db>  add jar 
/opt/hadoop/lib/testFunction.jar

0: jdbc:hive2://server:10002/db>  create temporary function ten_thousand as 
'com.test.tenThousand';


TURN OFF VECTORIZATION:0: jdbc:hive2://server:10002/db>  set 
hive.vectorized.execution.enabled=false;

QUERY (RESULTS AS EXPECTED):0: jdbc:hive2://server:10002/db> select first from 
testTabOrc where first between ten_thousand()-10000 and ten_thousand()-9995;
+--------+| first  |+--------+
| 1      || 2      || 3      |+--------+
3 rows selected (15.286 seconds)
TURN ON VECTORIZATION:
0: jdbc:hive2://server:10002/db>  set hive.vectorized.execution.enabled=true;
QUERY AGAIN (WRONG RESULTS):
0: jdbc:hive2://server:10002/db> select first from testTabOrc where first 
between ten_thousand()-10000 and ten_thousand()-9995;
+--------+| first  |+--------++--------+
No rows selected (17.763 seconds)

RE: Vectorization with UDFs returns incorrect results

Reply via email to