Zoltan Ivanfi created HIVE-21290:
------------------------------------
Summary: Restore historical way of handling timestamps in Parquet
while keeping the new semantics at the same time
Key: HIVE-21290
URL: https://issues.apache.org/jira/browse/HIVE-21290
Project: Hive
Issue Type: Sub-task
Reporter: Zoltan Ivanfi
This sub-task is for implementing the Parquet-specific parts of the following
plan:
h1. Problem
Historically, the semantics of the TIMESTAMP type in Hive depended on the file
format. Timestamps in Avro, Parquet and RCFiles with a binary SerDe had
_Instant_ semantics, while timestamps in ORC, textfiles and RCFiles with a text
SerDe had _LocalDateTime_ semantics.
The Hive community wanted to get rid of this inconsistency and have
_LocalDateTime_ semantics in Avro, Parquet and RCFiles with a binary SerDe as
well. *Hive 3.1 turned off normalization to UTC* to achieve this. While this
leads to the desired new semantics, it also leads to incorrect results when new
Hive versions read timestamps written by old Hive versions or when old Hive
versions or any other component not aware of this change (including legacy
Impala and Spark versions) read timestamps written by new Hive versions.
h1. Solution
To work around this issue, Hive *should restore the practice of normalizing to
UTC* when writing timestamps to Avro, Parquet and RCFiles with a binary SerDe.
In itself, this would restore the historical _Instant_ semantics, which is
undesirable. In order to achieve the desired _LocalDateTime_ semantics in spite
of normalizing to UTC, newer Hive versions should record the session-local
local time zone in the file metadata fields serving arbitrary key-value storage
purposes.
When reading back files with this time zone metadata, newer Hive versions (or
any other new component aware of this extra metadata) can achieve
_LocalDateTime_ semantics by *converting from UTC to the saved time zone
(instead of to the local time zone)*. Legacy components that are unaware of the
new metadata can read the files without any problem and the timestamps will
show the historical Instant behaviour to them.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)