Vectorization with UDFs returns incorrect results

Benjamin Bowman Fri, 30 May 2014 05:25:39 -0700

Hive 0.13 & Hadoop 2.4

I am having an issue when using the combination of vectorized query
execution, BETWEEN, and a custom UDF.  When I have vectorization on, my
query returns an empty set.  When I then turn vectorization off, my query
returns the correct results.


Example Query:  SELECT column_1 FROM table_1 WHERE column_1 BETWEEN (UDF_1
- X) and UDF_1

My UDFs seem to be working for everything else except this specific
circumstance.  Is this a issue in the hive software or am I writing my UDFs
in such a way that they do not work with vectorization?  If the latter,
what is the correct way?

I created a test scenario where I was able to reproduce this problem I am
seeing:

*TEST UDF (SIMPLE FUNCTION THAT TAKES NO ARGUMENTS AND RETURNS 10000):  *
package com.test;

import org.apache.hadoop.hive.ql.exec.Description;
import org.apache.hadoop.hive.ql.exec.UDF;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import java.lang.String;
import java.lang.*;

public class tenThousand extends UDF {

  private final LongWritable result = new LongWritable();

  public LongWritable evaluate() {
    result.set(10000);
    return result;
  }
}

*TEST DATA (test.input):*
1|CBCABC|12
2|DBCABC|13
3|EBCABC|14
40000|ABCABC|15
50000|BBCABC|16
60000|CBCABC|17

*CREATING ORC TABLE:*
0: jdbc:hive2://server:10002/db> create table testTabOrc (first bigint,
second varchar(20), third int) partitioned by (range int) clustered by
(first) sorted by (first) into 8 buckets stored as orc tblproperties
("orc.compress" = "SNAPPY", "orc.index" = "true");

*CREATE LOADING TABLE:*
0: jdbc:hive2://server:10002/db> create table loadingDir (first bigint,
second varchar(20), third int) partitioned by (range int) row format
delimited fields terminated by '|' stored as textfile;

*COPY IN DATA:*
[root@server]#  hadoop fs -copyFromLocal /tmp/test.input /db/loading/.

*ORC DATA:*
[root@server]#  beeline -u jdbc:hive2://server:10002/db -n root --hiveconf
hive.exec.dynamic.partition.mode=nonstrict --hiveconf
hive.enforce.sorting=true -e "insert into table testTabOrc partition(range)
select * from loadingDir;"

*LOAD TEST FUNCTION:*
0: jdbc:hive2://server:10002/db>  add jar /opt/hadoop/lib/testFunction.jar
0: jdbc:hive2://server:10002/db>  create temporary function ten_thousand as
'com.test.tenThousand';

*TURN OFF VECTORIZATION:*
0: jdbc:hive2://server:10002/db>  set
hive.vectorized.execution.enabled=false;

*QUERY (RESULTS AS EXPECTED):*
0: jdbc:hive2://server:10002/db> select first from testTabOrc where first
between ten_thousand()-10000 and ten_thousand()-9995;
+--------+
| first  |
+--------+
| 1      |
| 2      |
| 3      |
+--------+
3 rows selected (15.286 seconds)

*TURN ON VECTORIZATION:*
0: jdbc:hive2://server:10002/db>  set
hive.vectorized.execution.enabled=true;

*QUERY AGAIN (WRONG RESULTS):*
0: jdbc:hive2://server:10002/db> select first from testTabOrc where first
between ten_thousand()-10000 and ten_thousand()-9995;
+--------+
| first  |
+--------+
+--------+
No rows selected (17.763 seconds)

Vectorization with UDFs returns incorrect results

Reply via email to