Nicholas Brenwald created HIVE-12678:
----------------------------------------

             Summary: BETWEEN relational operator sometimes returns incorrect 
results against PARQUET tables
                 Key: HIVE-12678
                 URL: https://issues.apache.org/jira/browse/HIVE-12678
             Project: Hive
          Issue Type: Bug
    Affects Versions: 1.1.0
            Reporter: Nicholas Brenwald


When querying a parquet table, the BETWEEN relational operator returns 
incorrect results when hive.optimize.index.filter and hive.optimize.ppd.storage 
are enabled

Create a parquet table:
{code}
create table t(c string) stored as parquet;
{code}

Insert some strings representing dates
{code}
insert into t select '2015-12-09' from default.dual limit 1;
insert into t select '2015-12-10' from default.dual limit 1;
insert into t select '2015-12-11' from default.dual limit 1;
{code}

h3. Example 1
This query correctly returns 3:
{code}
set hive.optimize.index.filter=true;
set hive.optimize.ppd.storage=true;
select count(*) from t where c >= '2015-12-09' and c <= '2015-12-11';
+------+--+
| _c0  |
+------+--+
| 3    |
+------+--+
{code}

This query incorrectly returns 1:
{code}
set hive.optimize.index.filter=true;
set hive.optimize.ppd.storage=true;
select count(*) from t where c between '2015-12-09' and '2015-12-11';
+------+--+
| _c0  |
+------+--+
| 1    |
+------+--+
{code}

Disabling hive.optimize.findex.filter resolves the problem. This query now 
correctly returns 3:
{code}
set hive.optimize.index.filter=false;
set hive.optimize.ppd.storage=true;
select count(*) from t where c between '2015-12-09' and '2015-12-11';
+------+--+
| _c0  |
+------+--+
| 3    |
+------+--+
{code}

Disabling hive.optimize.ppd.storage resolves the problem. This query now 
correctly returns 3:
{code}
set hive.optimize.index.filter=true;
set hive.optimize.ppd.storage=false;
select count(*) from t where c between '2015-12-09' and '2015-12-11';
+------+--+
| _c0  |
+------+--+
| 3    |
+------+--+
{code}

h3. Example 2
This query correctly returns 1:
{code}
set hive.optimize.index.filter=true;
set hive.optimize.ppd.storage=true;
select count(*) from t where c >=  '2015-12-10' and c <= '2015-12-10';
+------+--+
| _c0  |
+------+--+
| 1    |
+------+--+
{code}

This query incorrectly returns 0:
{code}
set hive.optimize.index.filter=true;
set hive.optimize.ppd.storage=true;
select count(*) from t where c between '2015-12-10' and '2015-12-10';
+------+--+
| _c0  |
+------+--+
| 0    |
+------+--+
{code}

Disabling hive.optimize.findex.filter resolves the problem. This query now 
correctly returns 1:
{code}
set hive.optimize.index.filter=false;
set hive.optimize.ppd.storage=true;
select count(*) from t where c >= '2015-12-10' and c <= '2015-12-10';
+------+--+
| _c0  |
+------+--+
| 1    |
+------+--+
{code}

Disabling hive.optimize.ppd.storage resolves the problem. This query now 
correctly returns 1:
{code}
set hive.optimize.index.filter=true;
set hive.optimize.ppd.storage=false;
select count(*) from t where c >= '2015-12-10' and c <= '2015-12-10';
+------+--+
| _c0  |
+------+--+
| 1    |
+------+--+
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to