[ 
https://issues.apache.org/jira/browse/HIVE-12767?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15131453#comment-15131453
 ] 

Ryan Blue commented on HIVE-12767:
----------------------------------

[~spena], the q tests from the last patch look great and make me confident that 
the create table and property behavior are correct. I still think that the time 
zone conversion isn't implementing what it should (see below). I also found a 
couple of minor things:

# Minor: The CTAS test says the zone property on the resulting table should be 
PST, but is is (and should be) UTC. I think the comment is wrong and the 
behavior is correct.
# It would be nice to validate that the table property always overrides the 
global option
# Is there a test that files are created with the table property in key/value 
metadata? I think that the copy from one table to another with a different 
property set wouldn't work otherwise, but I want to make sure.
# What is the Hive 2.0 test doing? It looks like it is validating that setting 
the table property to UTC converts values back to local time? Or was the file 
written with UTC? (If so, we should have one written in another zone.)

The Hive 2.0 test I think demonstrates that the time zone conversion isn't 
correct. It is correct for the with and without conversion cases. But, it isn't 
a correct implementation for the meaning of the table property's time zone. In 
the Hive 2.0 test, it is set to UTC, but the value is read in local (if I'm 
reading this correctly).

Here's the problem: the back-end converts julian/nanos to year/month/day/hour 
values, then sets those on a calendar. It then uses the backing time in 
milliseconds to create a Timestamp. That timestamp is in _local_ time because 
of HIVE-12192, so to have the time _not_ adjusted, the calendar needs to also 
use the local zone. So using the local zone ends up not adjusting the value.

The problem is that the current patch uses this implementation to adjust values 
to the table property zone. That doesn't work because the "no offset" zone for 
the table property is UTC, not the local zone. I'm attaching a file that 
demonstrates the problem. It uses [Spark's implementation of 
fromJulianDay|https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala#L180]
 to demonstrate what I'm talking about. I chose a date/time at random and 
tested out the two implementations.

Tests that show _correct_ behavior:
* {{testKnownTimestampWithoutConversion}} shows that when UTC is used, the 
fromJulianDay value is the same as the "skip conversion" value returned by 
NanoTimeUtils.
* {{testKnownTimestampWithConversion}} shows that when the local zone is used 
(for me, PST), the fromJulianDay value is the same as the "do not skip 
conversion" value returned by NanoTimeUtils.

Tests that show _incorrect_ behavior:
* {{testKnownWithZoneArgumentUTC}} shows that passing a UTC calendar to 
NanoTimeUtils does not match the fromJulianDay/UTC value that matched the "skip 
conversion" in tests above. This doesn't pass, but should.
* {{testKnownWithWrongZone}} shows that passing a PST calendar to NanoTimeUtils 
matches the fromJulianDay/UTC value that corresponds to "skip conversion". This 
shouldn't pass, but does.

I'm attaching my test to this and can help if you have any questions about it.

> Implement table property to address Parquet int96 timestamp bug
> ---------------------------------------------------------------
>
>                 Key: HIVE-12767
>                 URL: https://issues.apache.org/jira/browse/HIVE-12767
>             Project: Hive
>          Issue Type: Bug
>    Affects Versions: 1.2.1, 2.0.0
>            Reporter: Sergio Peña
>            Assignee: Sergio Peña
>         Attachments: HIVE-12767.3.patch, HIVE-12767.4.patch
>
>
> Parque timestamps using INT96 are not compatible with other tools, like 
> Impala, due to issues in Hive because it adjusts timezones values in a 
> different way than Impala.
> To address such issues. a new table property (parquet.mr.int96.write.zone) 
> must be used in Hive that detects what timezone to use when writing and 
> reading timestamps from Parquet.
> The following is the exit criteria for the fix:
> * Hive will read Parquet MR int96 timestamp data and adjust values using a 
> time zone from a table property, if set, or using the local time zone if it 
> is absent. No adjustment will be applied to data written by Impala.
> * Hive will write Parquet int96 timestamps using a time zone adjustment from 
> the same table property, if set, or using the local time zone if it is 
> absent. This keeps the data in the table consistent.
> * New tables created by Hive will set the table property to UTC if the global 
> option to set the property for new tables is enabled.
> ** Tables created using CREATE TABLE and CREATE TABLE LIKE FILE will not set 
> the property unless the global setting to do so is enabled.
> ** Tables created using CREATE TABLE LIKE <OTHER TABLE> will copy the 
> property of the table that is copied.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to