[jira] [Work logged] (HIVE-24693) Parquet Timestamp Values Read/Write Very Slow

ASF GitHub Bot (Jira) Thu, 04 Feb 2021 07:34:06 -0800


     [ 
https://issues.apache.org/jira/browse/HIVE-24693?focusedWorklogId=547692&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-547692
 ]


ASF GitHub Bot logged work on HIVE-24693:
-----------------------------------------

                Author: ASF GitHub Bot
            Created on: 04/Feb/21 15:33
            Start Date: 04/Feb/21 15:33
    Worklog Time Spent: 10m 
      Work Description: belugabehr commented on pull request #1938:
URL: https://github.com/apache/hive/pull/1938#issuecomment-773396681


   Before this PR, Hive was not parsing negative dates correctly.  Now it has 
to because the "parse" code has been removed and it is now just uses the raw 
values (which are negative numbers).  I changed the formatters to accept 
negative (and year zero).
   
   I had to update the QTests unfortunately because the 'mask' UDF was assuming 
that the year 0001 was the first valid year.  The function of the 'mask' method 
is to transform they year (for example) 2021 -> 0000 in order to "mask" it.  
However, since it did not respect/understand/support year '0000', the year is 
displayed as 0001.  This is a bit confusing from a user standpoint.  Anyway, 
Hive should be able to support a year 0000, and these code changes require it, 
so I have updated the qtests to expect a year of value 0000.
   
   https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Types
   >The range of values supported for the Date type is 0000-01-01 to 
9999-12-31


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
-------------------

    Worklog Id:     (was: 547692)
    Time Spent: 1h 50m  (was: 1h 40m)

> Parquet Timestamp Values Read/Write Very Slow
> ---------------------------------------------
>
>                 Key: HIVE-24693
>                 URL: https://issues.apache.org/jira/browse/HIVE-24693
>             Project: Hive
>          Issue Type: Improvement
>            Reporter: David Mollitor
>            Assignee: David Mollitor
>            Priority: Critical
>              Labels: pull-request-available
>          Time Spent: 1h 50m
>  Remaining Estimate: 0h
>
> Parquet {{DataWriteableWriter}} relias on {{NanoTimeUtils}} to convert a 
> timestamp object into a binary value.  The way in which it does this,... it 
> calls {{toString()}} on the timestamp object, and then parses the String.  
> This particular timestamp do not carry a timezone, so the string is something 
> like:
> {{2021-21-03 12:32:23.0000...}}
> The parse code tries to parse the string assuming there is a time zone, and 
> if not, falls-back and applies the provided "default time zone".  As was 
> noted in [HIVE-24353], if something fails to parse, it is very expensive to 
> try to parse again.  So, for each timestamp in the Parquet file, it:
> * Builds a string from the time stamp
> * Parses it (throws an exception, parses again)
> There is no need to do this kind of string manipulations/parsing, it should 
> just be using the epoch millis/seconds/time stored internal to the Timestamp 
> object.
> {code:java}
>   // Converts Timestamp to TimestampTZ.
>   public static TimestampTZ convert(Timestamp ts, ZoneId defaultTimeZone) {
>     return parse(ts.toString(), defaultTimeZone);
>   }
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (HIVE-24693) Parquet Timestamp Values Read/Write Very Slow

Reply via email to