[ 
https://issues.apache.org/jira/browse/HIVE-21002?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16742257#comment-16742257
 ] 

Zoltan Ivanfi commented on HIVE-21002:
--------------------------------------

Hive 3.1 does return "2018-01-01 *00*:00:00.0" for new files, because it writes 
and reads without normalizing to UTC, which is different from what Hive 2.x 
did. This is exactly what causes Hive 3.1 to return "2018-01-01 *08*:00:00.0" 
for a file written by Hive 2.x, because that version normalized the timestamp 
to UTC before writing it. Since there already exist huge amounts of data 
written using Hive 2.x, Hive 3.x should remain capable of reading that existing 
data back correctly.

Even if it would be possible to detect the version of Hive that wrote a file, 
adding another workaround based on it would not solve the interoperability 
problem. Users may move data between older and newer Hive versions or have 
other legacy components that read timestamps from Parquet. These older 
applications do not contain the necessary logic to deal with Hive 3.1 
semantics, they build on the assumption that timestamps written by any version 
of Hive are normalized to UTC.

> Backwards incompatible change: Hive 3.1 reads back Avro and Parquet 
> timestamps written by Hive 2.x incorrectly
> --------------------------------------------------------------------------------------------------------------
>
>                 Key: HIVE-21002
>                 URL: https://issues.apache.org/jira/browse/HIVE-21002
>             Project: Hive
>          Issue Type: Bug
>    Affects Versions: 3.1.0, 3.1.1
>            Reporter: Zoltan Ivanfi
>            Priority: Major
>
> Hive 3.1 reads back Avro and Parquet timestamps written by Hive 2.x 
> incorrectly. As an example session to demonstrate this problem, create a 
> dataset using Hive version 2.x in America/Los_Angeles:
> {code:sql}
> hive> create table ts_‹format› (ts timestamp) stored as ‹format›;
> hive> insert into ts_‹format› values (*‘2018-01-01 00:00:00.000’*);
> {code}
> Querying this table by issuing
> {code:sql}
> hive> select * from ts_‹format›;
> {code}
> from different time zones using different versions of Hive and different 
> storage formats gives the following results:
> |‹format›|Time zone|Hive 2.x|Hive 3.1|
> |Avro and Parquet|America/Los_Angeles|2018-01-01 *00*:00:00.0|2018-01-01 
> *08*:00:00.0|
> |Avro and Parquet|Europe/Paris|2018-01-01 *09*:00:00.0|2018-01-01 
> *08*:00:00.0|
> |Textfile and ORC|America/Los_Angeles|2018-01-01 00:00:00.0|2018-01-01 
> 00:00:00.0|
> |Textfile and ORC|Europe/Paris|2018-01-01 00:00:00.0|2018-01-01 00:00:00.0|
> *Hive 3.1 clearly gives different results than Hive 2.x for timestamps stored 
> in Avro and Parquet formats.* Apache ORC behaviour has not changed because it 
> was modified to adjust timestamps to retain backwards compatibility. Textfile 
> behaviour has not changed, because its processing involves parsing and 
> formatting instead of proper serializing and deserializing, so they 
> inherently had LocalDateTime semantics even in Hive 2.x.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to