When you turn "vectorized" on, does the following query consistently return 10000 in the output? select ten_thousand() from testTabOrc Yong
Date: Fri, 30 May 2014 08:24:43 -0400 Subject: Vectorization with UDFs returns incorrect results From: bbowman...@gmail.com To: user@hive.apache.org Hive 0.13 & Hadoop 2.4 I am having an issue when using the combination of vectorized query execution, BETWEEN, and a custom UDF. When I have vectorization on, my query returns an empty set. When I then turn vectorization off, my query returns the correct results. Example Query: SELECT column_1 FROM table_1 WHERE column_1 BETWEEN (UDF_1 - X) and UDF_1 My UDFs seem to be working for everything else except this specific circumstance. Is this a issue in the hive software or am I writing my UDFs in such a way that they do not work with vectorization? If the latter, what is the correct way? I created a test scenario where I was able to reproduce this problem I am seeing: TEST UDF (SIMPLE FUNCTION THAT TAKES NO ARGUMENTS AND RETURNS 10000): package com.test; import org.apache.hadoop.hive.ql.exec.Description;import org.apache.hadoop.hive.ql.exec.UDF; import org.apache.hadoop.io.LongWritable;import org.apache.hadoop.io.Text;import java.lang.String; import java.lang.*; public class tenThousand extends UDF { private final LongWritable result = new LongWritable(); public LongWritable evaluate() { result.set(10000); return result; }} TEST DATA (test.input):1|CBCABC|12 2|DBCABC|133|EBCABC|1440000|ABCABC|1550000|BBCABC|16 60000|CBCABC|17 CREATING ORC TABLE: 0: jdbc:hive2://server:10002/db> create table testTabOrc (first bigint, second varchar(20), third int) partitioned by (range int) clustered by (first) sorted by (first) into 8 buckets stored as orc tblproperties ("orc.compress" = "SNAPPY", "orc.index" = "true"); CREATE LOADING TABLE:0: jdbc:hive2://server:10002/db> create table loadingDir (first bigint, second varchar(20), third int) partitioned by (range int) row format delimited fields terminated by '|' stored as textfile; COPY IN DATA:[root@server]# hadoop fs -copyFromLocal /tmp/test.input /db/loading/. ORC DATA:[root@server]# beeline -u jdbc:hive2://server:10002/db -n root --hiveconf hive.exec.dynamic.partition.mode=nonstrict --hiveconf hive.enforce.sorting=true -e "insert into table testTabOrc partition(range) select * from loadingDir;" LOAD TEST FUNCTION:0: jdbc:hive2://server:10002/db> add jar /opt/hadoop/lib/testFunction.jar 0: jdbc:hive2://server:10002/db> create temporary function ten_thousand as 'com.test.tenThousand'; TURN OFF VECTORIZATION:0: jdbc:hive2://server:10002/db> set hive.vectorized.execution.enabled=false; QUERY (RESULTS AS EXPECTED):0: jdbc:hive2://server:10002/db> select first from testTabOrc where first between ten_thousand()-10000 and ten_thousand()-9995; +--------+| first |+--------+ | 1 || 2 || 3 |+--------+ 3 rows selected (15.286 seconds) TURN ON VECTORIZATION: 0: jdbc:hive2://server:10002/db> set hive.vectorized.execution.enabled=true; QUERY AGAIN (WRONG RESULTS): 0: jdbc:hive2://server:10002/db> select first from testTabOrc where first between ten_thousand()-10000 and ten_thousand()-9995; +--------+| first |+--------++--------+ No rows selected (17.763 seconds)