[jira] [Updated] (HIVE-26270) Wrong timestamps when reading Hive 3.1.x Parquet files with vectorized reader

Stamatis Zampetakis (Jira) Fri, 27 May 2022 07:05:08 -0700


     [ 
https://issues.apache.org/jira/browse/HIVE-26270?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Stamatis Zampetakis updated HIVE-26270:
---------------------------------------
    Labels: compatibility timestamp  (was: )

> Wrong timestamps when reading Hive 3.1.x Parquet files with vectorized reader
> -----------------------------------------------------------------------------
>
>                 Key: HIVE-26270
>                 URL: https://issues.apache.org/jira/browse/HIVE-26270
>             Project: Hive
>          Issue Type: Bug
>          Components: HiveServer2, Parquet
>            Reporter: Stamatis Zampetakis
>            Assignee: Stamatis Zampetakis
>            Priority: Major
>              Labels: compatibility, timestamp
>
> Parquet files written in Hive 3.1.x onwards with timezone set to US/Pacific.
> {code:sql}
> CREATE TABLE employee (eid INT, birth timestamp) STORED AS PARQUET;
> INSERT INTO employee VALUES 
> (1, '1880-01-01 00:00:00'),
> (2, '1884-01-01 00:00:00'),
> (3, '1990-01-01 00:00:00');
> {code}
> Parquet files read with Hive 4.0.0-apha-1 onwards.
> +Without vectorization+ results are correct.
> {code:sql}
> SELECT * FROM employee;
> {code}
> {noformat}
> 1     1880-01-01 00:00:00
> 2     1884-01-01 00:00:00
> 3     1990-01-01 00:00:00
> {noformat}
> +With vectorization+ some timestamps are shifted.
> {code:sql}
> -- Disable fetch task conversion to force vectorization kick in
> set hive.fetch.task.conversion=none;
> SELECT * FROM employee;
> {code}
> {noformat}
> 1     1879-12-31 23:52:58
> 2     1884-01-01 00:00:00
> 3     1990-01-01 00:00:00
> {noformat}
> The problem is the same reported under HIVE-24074. The data were written 
> using the new Date/Time APIs (java.time) in version Hive 3.1.3 and here they 
> were read using the old APIs (java.sql).
> The difference with HIVE-24074 is that here the problem appears only for 
> vectorized execution while the non-vectorized reader is working fine so there 
> is some *inconsistency in the behavior* of vectorized and non vectorized 
> readers.
> Non-vectorized reader works fine cause it derives automatically that it 
> should use the new JDK APIs to read back the timestamp value. This is 
> possible in this case cause there are metadata information in the file (i.e., 
> the presence of {{{}writer.time.zone{}}}) from where it can infer that the 
> timestamps were written using the new Date/Time APIs.
> The inconsistent behavior between vectorized and non-vectorized reader is a 
> regression caused by HIVE-25104. This JIRA is an attempt to re-align the 
> behavior between vectorized and non-vectorized readers.
> Note that if the file metadata are empty both vectorized and non-vectorized 
> reader cannot determine which APIs to use for the conversion and in this case 
> it is necessary the user to set the
> {{hive.parquet.timestamp.legacy.conversion.enabled}} explicitly to get back 
> the correct results.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

[jira] [Updated] (HIVE-26270) Wrong timestamps when reading Hive 3.1.x Parquet files with vectorized reader

Reply via email to