[ https://issues.apache.org/jira/browse/HIVE-22500?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Guillaume updated HIVE-22500: ----------------------------- Description: Consider this example. Preparation: {code:java} create temporary table opens as ( select stack(1, 1 , cast('2019-11-13 08:07:28' as timestamp) ) as (id , load_ts ) ); {code} Queries: This is just about counting the number of rows, with filters always matching, and possibly sort by. 1 is always expected. {code:java} select count(*) from ( select * from opens) t; select count(*) from ( select * from opens sort by id) t; select count(*) from ( select * from opens where load_ts >= '2019-11-13 08:07:00' ) t; select count(*) from ( select * from opens where load_ts >= '2019-11-13 08:07:00' sort by id) t; select count(*) from ( select * from opens where load_ts <= '2019-11-13 09:07:00') t; select count(*) from ( select * from opens where load_ts <= '2019-11-13 09:07:00' sort by id) t; {code} The latest query (_sort by_ and _<=_ on timestamp) returns 0 rows. I believe that this is the cause of other issues I have, where I have missing rows in queries with the timestamp (but not the explicit sort by). Note that if instead of a temporary table I use a CTE for opens, the issue does not appear. I tried workarounds (inverse order of operands, adding _not_ or _not not_ ) to no avail. One thing that did work is to explicitly cast the string to a timestamp: {code:java} select count(*) from ( select * from opens where load_ts <= cast('2019-11-13 09:07:00' as timestamp) sort by id) t;{code} It might be good practice indeed, but there still is a discrepancy between how _>=_ and _<=_ are handled, or how _sort by_ works. Note: this is on Hive from [hdp3.1.4|[https://docs.cloudera.com/HDPDocuments/HDP3/HDP-3.1.4/release-notes/content/patch_hive.html]], without llap. was: Consider this example. Preparation: {code:java} create temporary table opens as ( select stack(1, 1 , cast('2019-11-13 08:07:28' as timestamp) ) as (id , load_ts ) ); {code} Queries. This is just about counting the number of rows, with filters always matching, and possibly sort by. 1 is always expected. {code:java} select count(*) from ( select * from opens) t; select count(*) from ( select * from opens sort by id) t; select count(*) from ( select * from opens where load_ts >= '2019-11-13 08:07:00' ) t; select count(*) from ( select * from opens where load_ts >= '2019-11-13 08:07:00' sort by id) t; select count(*) from ( select * from opens where load_ts <= '2019-11-13 09:07:00') t; select count(*) from ( select * from opens where load_ts <= '2019-11-13 09:07:00' sort by id) t; {code} The latest query (_sort by_ and _<=_ on timestamp) returns 0 rows. I believe that this is the cause of other issues I have, where I have missing rows in queries with the timestamp (but not the explicit sort by). Note that if instead of a temporary table I use a CTE for opens, the issue does not appear. I tried workarounds (inverse order of operands, adding _not_ or _not not_ ) to no avail. One thing that did work is to explicitly cast the string to a timestamp: {code:java} select count(*) from ( select * from opens where load_ts <= cast('2019-11-13 09:07:00' as timestamp) sort by id) t;{code} It might be good practice indeed, but there still is a discrepancy between how _>=_ and _<=_ are handled, or how _sort by_ works. Note: this is on Hive from [hdp3.1.4|[https://docs.cloudera.com/HDPDocuments/HDP3/HDP-3.1.4/release-notes/content/patch_hive.html]], without llap. > sort by and timestamp casting filters out too many rows > ------------------------------------------------------- > > Key: HIVE-22500 > URL: https://issues.apache.org/jira/browse/HIVE-22500 > Project: Hive > Issue Type: Bug > Components: Hive > Affects Versions: 3.1.0 > Reporter: Guillaume > Priority: Major > > Consider this example. > Preparation: > {code:java} > create temporary table opens as ( > select stack(1, > 1 , cast('2019-11-13 08:07:28' as timestamp) > ) as (id , load_ts ) > ); > {code} > > Queries: This is just about counting the number of rows, with filters always > matching, and possibly sort by. 1 is always expected. > {code:java} > select count(*) from ( select * from opens) t; > select count(*) from ( select * from opens sort by id) t; > select count(*) from ( select * from opens where load_ts >= '2019-11-13 > 08:07:00' ) t; > select count(*) from ( select * from opens where load_ts >= '2019-11-13 > 08:07:00' sort by id) t; > select count(*) from ( select * from opens where load_ts <= '2019-11-13 > 09:07:00') t; > select count(*) from ( select * from opens where load_ts <= '2019-11-13 > 09:07:00' sort by id) t; > {code} > > The latest query (_sort by_ and _<=_ on timestamp) returns 0 rows. > I believe that this is the cause of other issues I have, where I have missing > rows in queries with the timestamp (but not the explicit sort by). > Note that if instead of a temporary table I use a CTE for opens, the issue > does not appear. > I tried workarounds (inverse order of operands, adding _not_ or _not not_ ) > to no avail. > One thing that did work is to explicitly cast the string to a timestamp: > {code:java} > select count(*) from ( select * from opens where load_ts <= cast('2019-11-13 > 09:07:00' as timestamp) sort by id) t;{code} > It might be good practice indeed, but there still is a discrepancy between > how _>=_ and _<=_ are handled, or how _sort by_ works. > Note: this is on Hive from > [hdp3.1.4|[https://docs.cloudera.com/HDPDocuments/HDP3/HDP-3.1.4/release-notes/content/patch_hive.html]], > without llap. -- This message was sent by Atlassian Jira (v8.3.4#803005)