[jira] [Updated] (HIVE-22500) sort by and timestamp casting filters out too many rows

Guillaume (Jira) Fri, 15 Nov 2019 01:42:46 -0800


     [ 
https://issues.apache.org/jira/browse/HIVE-22500?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Guillaume updated HIVE-22500:
-----------------------------
    Description: 
Consider this example.

Preparation: 
{code:java}
create temporary table opens as (
  select stack(1, 
         1 , cast('2019-11-13 08:07:28' as timestamp)
   ) as (id , load_ts )
);
{code}
 

Queries: This is just about counting the number of rows, with filters always 
matching, and possibly sort by. 1 is always expected.
{code:java}
select count(*) from ( select * from opens) t;
select count(*) from ( select * from opens sort by id) t;
select count(*) from ( select * from opens where load_ts >= '2019-11-13 
08:07:00' ) t;
select count(*) from ( select * from opens where load_ts >= '2019-11-13 
08:07:00' sort by id) t;
select count(*) from ( select * from opens where load_ts <= '2019-11-13 
09:07:00') t;
select count(*) from ( select * from opens where load_ts <= '2019-11-13 
09:07:00' sort by id) t;
{code}
 

The latest query (_sort by_ and _<=_ on timestamp) returns 0 rows.

I believe that this is the cause of other issues I have, where I have missing 
rows in queries with the timestamp (but not the explicit sort by).

Note that if instead of a temporary table I use a CTE for opens, the issue does 
not appear.

I tried workarounds (inverse order of operands, adding _not_ or _not not_ ) to 
no avail.

One thing that did work is to explicitly cast the string to a timestamp:
{code:java}
select count(*) from ( select * from opens where load_ts <= cast('2019-11-13 
09:07:00' as timestamp) sort by id) t;{code}
It might be good practice indeed, but there still is a discrepancy between how 
_>=_ and _<=_ are handled, or how _sort by_ works.

Note: this is on Hive from 
[hdp3.1.4|[https://docs.cloudera.com/HDPDocuments/HDP3/HDP-3.1.4/release-notes/content/patch_hive.html]],
 without llap.

  was:
Consider this example.

Preparation:

 
{code:java}
create temporary table opens as (
  select stack(1, 
         1 , cast('2019-11-13 08:07:28' as timestamp)
   ) as (id , load_ts )
);
{code}
Queries. This is just about counting the number of rows, with filters always 
matching, and possibly sort by. 1 is always expected.

 
{code:java}
select count(*) from ( select * from opens) t;
select count(*) from ( select * from opens sort by id) t;
select count(*) from ( select * from opens where load_ts >= '2019-11-13 
08:07:00' ) t;
select count(*) from ( select * from opens where load_ts >= '2019-11-13 
08:07:00' sort by id) t;
select count(*) from ( select * from opens where load_ts <= '2019-11-13 
09:07:00') t;
select count(*) from ( select * from opens where load_ts <= '2019-11-13 
09:07:00' sort by id) t;
{code}
 

The latest query (_sort by_ and _<=_ on timestamp) returns 0 rows.

I believe that this is the cause of other issues I have, where I have missing 
rows in queries with the timestamp (but not the explicit sort by).

Note that if instead of a temporary table I use a CTE for opens, the issue does 
not appear.

I tried workarounds (inverse order of operands, adding _not_ or _not not_ ) to 
no avail.

One thing that did work is to explicitly cast the string to a timestamp:
{code:java}
select count(*) from ( select * from opens where load_ts <= cast('2019-11-13 
09:07:00' as timestamp) sort by id) t;{code}
It might be good practice indeed, but there still is a discrepancy between how 
_>=_ and _<=_ are handled, or how _sort by_ works.

Note: this is on Hive from 
[hdp3.1.4|[https://docs.cloudera.com/HDPDocuments/HDP3/HDP-3.1.4/release-notes/content/patch_hive.html]],
 without llap.


> sort by and timestamp casting filters out too many rows
> -------------------------------------------------------
>
>                 Key: HIVE-22500
>                 URL: https://issues.apache.org/jira/browse/HIVE-22500
>             Project: Hive
>          Issue Type: Bug
>          Components: Hive
>    Affects Versions: 3.1.0
>            Reporter: Guillaume
>            Priority: Major
>
> Consider this example.
> Preparation: 
> {code:java}
> create temporary table opens as (
>   select stack(1, 
>          1 , cast('2019-11-13 08:07:28' as timestamp)
>    ) as (id , load_ts )
> );
> {code}
>  
> Queries: This is just about counting the number of rows, with filters always 
> matching, and possibly sort by. 1 is always expected.
> {code:java}
> select count(*) from ( select * from opens) t;
> select count(*) from ( select * from opens sort by id) t;
> select count(*) from ( select * from opens where load_ts >= '2019-11-13 
> 08:07:00' ) t;
> select count(*) from ( select * from opens where load_ts >= '2019-11-13 
> 08:07:00' sort by id) t;
> select count(*) from ( select * from opens where load_ts <= '2019-11-13 
> 09:07:00') t;
> select count(*) from ( select * from opens where load_ts <= '2019-11-13 
> 09:07:00' sort by id) t;
> {code}
>  
> The latest query (_sort by_ and _<=_ on timestamp) returns 0 rows.
> I believe that this is the cause of other issues I have, where I have missing 
> rows in queries with the timestamp (but not the explicit sort by).
> Note that if instead of a temporary table I use a CTE for opens, the issue 
> does not appear.
> I tried workarounds (inverse order of operands, adding _not_ or _not not_ ) 
> to no avail.
> One thing that did work is to explicitly cast the string to a timestamp:
> {code:java}
> select count(*) from ( select * from opens where load_ts <= cast('2019-11-13 
> 09:07:00' as timestamp) sort by id) t;{code}
> It might be good practice indeed, but there still is a discrepancy between 
> how _>=_ and _<=_ are handled, or how _sort by_ works.
> Note: this is on Hive from 
> [hdp3.1.4|[https://docs.cloudera.com/HDPDocuments/HDP3/HDP-3.1.4/release-notes/content/patch_hive.html]],
>  without llap.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (HIVE-22500) sort by and timestamp casting filters out too many rows

Reply via email to