Hi all,

There was recently a discussion on the interpretation of the spec about the
"timezone" field of timestamp type (and different timestamp-related types
that Arrow should have). See
https://lists.apache.org/thread.html/r017084eed74edbc95810fc049056570f45b0bb034d6eeadd647e8621%40%3Cdev.arrow.apache.org%3E
Somewhat related, I want to start a discussion to what extent we want to
implement functionality (compute kernels) in Arrow C++ to deal with
timezones.

We just merged a PR to add some kernels to extract fields from timestamps
(year, month, day, hour, etc -> ARROW-11759
<https://github.com/apache/arrow/pull/10176>). But once you start with
kernels for timestamp data, you quickly run into the question: what to do
with tz-aware timestamps with a timezone?

For example, we have:
- ARROW-12980 <https://issues.apache.org/jira/browse/ARROW-12980> about
making those kernels to extract timestamp fields timezone aware. For
example, if you have tz-aware timestamp with hour "09:30:00+02:00", this is
stored internally as "07:30:00 UTC" (+ the actual timezone as metadata of
the type). And for a kernel to extract the "hour" field, you want that to
return 9 and not 7 (which would happen if we use the internal UTC value
ignoring the timezone information).
- ARROW-13033 <https://issues.apache.org/jira/browse/ARROW-13033> (which I
opened today) about adding functionality to convert a tz-naive "local time"
(local "clock" time in a not-yet-specified time zone) to a properly
timezone-aware timestamp with the user-specified time zone attached. This
can be useful to handle data that does not have sufficient timezone
information attached to the data/type itself, but for which you know what
the timezone should be. For example, having a timestamp with hour
"09:30:00" (no explicit timezone, implicitly UTC), but the user knows this
is actually "09:30:00 CEST", so then you want to convert this to the UTC
time ("07:30:00Z") that is equivalent to "09:30:00 CEST".

Both such kernels require a conversion between "UTC time" and tz-naive
"local time" (C++ local_t <https://en.cppreference.com/w/cpp/chrono/local_t>),
which requires looking up the offset for the given timezone at that time
point (the first example requires conversion from UTC to local time, the
second from local time to UTC time).

Personally, I think such kernels that can handle timezones are important
(if we want that users store tz-aware data in Arrow), but I want to ensure
we are generally OK with expanding the scope of Arrow to actually start
doing something with the tz information of the timestamp type (up to now we
just store that value in the type but not yet ever interpret it). Which
means dealing with timezone offsets, timezone databases etc. But luckily,
the date.h (https://github.com/HowardHinnant/date) we vendor already
includes all the required functionality.

Best,
Joris

Reply via email to