Re: [Format] Timestamp timezone semantics?

Julian Hyde Fri, 04 Jun 2021 10:46:48 -0700

The learning there is: library software shouldn’t use anything from its 
environment (time zone, locale, encoding, endianness). Functions that use time 
zone should always have a time zone parameter.


Once you take that step, the functions that work with zoneless timestamps start 
to look different to functions that work with local timestamps, and you start 
to realize that they should be separate data types.

> On Jun 3, 2021, at 12:26 PM, Wes McKinney <[email protected]> wrote:
> 
> Arrow's decision was not to permit storage of timestamps with
> "localized" representation (which is distinct from UTC internal
> representation with a different time zone set). The problem really
> comes down to the interpretation of "time zone naive" timestamps on
> different systems: operations in my opinion should not yield different
> results depending on the particular locale of the system where the
> operations are being run.
> 
> date on my Linux system returns 1622748048, which is 19:21 UTC. If you
> encounter 1622748048 without any given time zone, and want to
> interpret 1622748048 as CDT (US/Central where I live), then Arrow is
> asking you to localize that timestamp to the UTC representation of
> 19:21 CDT, which is 7 hours later, so you need to add 7 hours of
> seconds to the timestamp to adjust it to UTC.
> 
> In some systems, if you encounter 1622748048 without time zone
> indicated, the behavior of timestamp_day() or timestamp_hour() will
> depend on the system locale. We are recommending that the behavior of
> these functions should consistently have the UTC interpretation of the
> value rather than using the system locale. This is what Python does
> with "tz-naive" datetime.datetime objects — if you call access
> datetime.hour on a timezone-less datetime.datetime, it will return the
> same result no matter where in the world you are.
> 
> On Thu, Jun 3, 2021 at 1:19 PM Julian Hyde <[email protected]> wrote:
>> 
>> It seems that Arrow’s timestamp type can either have no time zone or be UTC. 
>> I think that is a flawed design, because doesn’t catch user errors.
>> 
>> Suppose you want to find the number of milliseconds between two timestamps. 
>> If the first has a timezone and the second is implicitly UTC, then you can 
>> convert them both to instants and subtract. But if the first has a timezone 
>> and the second has no time zone, you must supply a time zone for the second. 
>> So, the subtraction function will have a different signature.
>> 
>> There are many similar operations, where a time zone needs to be supplied, 
>> or where you cannot safely mix timestamps with different time zones.
>> 
>> Julian
>> 
>> 
>>> On Jun 3, 2021, at 11:07 AM, Adam Hooper <[email protected]> wrote:
>>> 
>>> On Thu, Jun 3, 2021 at 2:02 PM Adam Hooper <[email protected]> wrote:
>>> 
>>>> I understand isAdjustedToUTC=true to mean "timestamp", and
>>>> isAdjustedToUTC=false to mean, "int64 and I hope somebody attached some
>>>> docs because
>>>> https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#local-semantics-timestamps-not-normalized-to-utc
>>>> lists a whole slew of potential meanings and without extra metadata I'll
>>>> never be able to figure out what this column means."
>>>> 
>>> 
>>> Correcting myself here: Parquet isAdjustedToUTC=false does have just one
>>> meaning. It means encoding a "(year, month, day, hour, minute, second,
>>> microsecond)" tuple as a single integer.
>>> 
>>> Adam
>>> 
>>> --
>>> Adam Hooper
>>> +1-514-882-9694
>>> http://adamhooper.com
>>

Re: [Format] Timestamp timezone semantics?

Reply via email to