[ https://issues.apache.org/jira/browse/HIVE-26270?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Stamatis Zampetakis updated HIVE-26270: --------------------------------------- Labels: compatibility timestamp (was: ) > Wrong timestamps when reading Hive 3.1.x Parquet files with vectorized reader > ----------------------------------------------------------------------------- > > Key: HIVE-26270 > URL: https://issues.apache.org/jira/browse/HIVE-26270 > Project: Hive > Issue Type: Bug > Components: HiveServer2, Parquet > Reporter: Stamatis Zampetakis > Assignee: Stamatis Zampetakis > Priority: Major > Labels: compatibility, timestamp > > Parquet files written in Hive 3.1.x onwards with timezone set to US/Pacific. > {code:sql} > CREATE TABLE employee (eid INT, birth timestamp) STORED AS PARQUET; > INSERT INTO employee VALUES > (1, '1880-01-01 00:00:00'), > (2, '1884-01-01 00:00:00'), > (3, '1990-01-01 00:00:00'); > {code} > Parquet files read with Hive 4.0.0-apha-1 onwards. > +Without vectorization+ results are correct. > {code:sql} > SELECT * FROM employee; > {code} > {noformat} > 1 1880-01-01 00:00:00 > 2 1884-01-01 00:00:00 > 3 1990-01-01 00:00:00 > {noformat} > +With vectorization+ some timestamps are shifted. > {code:sql} > -- Disable fetch task conversion to force vectorization kick in > set hive.fetch.task.conversion=none; > SELECT * FROM employee; > {code} > {noformat} > 1 1879-12-31 23:52:58 > 2 1884-01-01 00:00:00 > 3 1990-01-01 00:00:00 > {noformat} > The problem is the same reported under HIVE-24074. The data were written > using the new Date/Time APIs (java.time) in version Hive 3.1.3 and here they > were read using the old APIs (java.sql). > The difference with HIVE-24074 is that here the problem appears only for > vectorized execution while the non-vectorized reader is working fine so there > is some *inconsistency in the behavior* of vectorized and non vectorized > readers. > Non-vectorized reader works fine cause it derives automatically that it > should use the new JDK APIs to read back the timestamp value. This is > possible in this case cause there are metadata information in the file (i.e., > the presence of {{{}writer.time.zone{}}}) from where it can infer that the > timestamps were written using the new Date/Time APIs. > The inconsistent behavior between vectorized and non-vectorized reader is a > regression caused by HIVE-25104. This JIRA is an attempt to re-align the > behavior between vectorized and non-vectorized readers. > Note that if the file metadata are empty both vectorized and non-vectorized > reader cannot determine which APIs to use for the conversion and in this case > it is necessary the user to set the > {{hive.parquet.timestamp.legacy.conversion.enabled}} explicitly to get back > the correct results. -- This message was sent by Atlassian Jira (v8.20.7#820007)