Re: [Format][Important] Needed clarification of timezone-less timestamps

Wes McKinney Thu, 17 Jun 2021 11:59:32 -0700

To take a step back to focus on some concrete issues

Parquet has two timestamp types: with (UTC-normalized)/without time
zone (non-UTC-normalized)
https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L268


The SQL standard (e.g. PostgresSQL) has two timestamp types:
with/without time zone — in some SQL implementations each slot can
have a different time zone
https://www.postgresql.org/docs/9.1/datatype-datetime.html
WITHOUT TIME ZONE: "timestamp without time zone value should be taken
or given as timezone local time"

Spark / Databricks discusses how Spark handles this
https://docs.databricks.com/spark/latest/dataframes-datasets/dates-timestamps.html#ansi-sql-and-spark-sql-timestamps
* WITHOUT TIME ZONE: "These timestamps are not bound to any time zone,
and are wall clock timestamps." — not UTC-normalized
* WITH TIME ZONE: "does not affect the physical point in time that the
timestamp represents, as that is fully represented by the UTC time
instant given by the other timestamp components"

pandas as discussed has non-UTC-normalized WITHOUT TIME ZONE "naive"
timestamps and UTC-normalized WITH TIME ZONE.

If we were to change Arrow's "WITHOUT TIMEZONE" semantics to be
interpreted as UTC-normalized, that would force all of these other
systems (and more) to serialize their data to be UTC-normalized (i.e.
calling the equivalent of pandas's tz_localize function) when they
convert to Arrow. This seems very harmful to me, and will make data
from these systems not accurately representable in Arrow and unable to
be round-tripped.

Perhaps we can make a spreadsheet and look comprehensively at how many
use cases would be disenfranchised by requiring UTC normalization
always.

On Tue, Jun 15, 2021 at 3:16 PM Adam Hooper <a...@adamhooper.com> wrote:
>
> On Tue, Jun 15, 2021 at 1:19 PM Weston Pace <weston.p...@gmail.com> wrote:
>
> > Arrow's "Timestamp with Timezone" can have fields extracted
> > from it.
> >
>
> Sure, one *can* extract fields from timestamp+tz. But I don't feel
> timestamp+tz is *designed* for extracting fields:
>
>    - Extracting fields from int64+tz is inefficient, because it bundles two
>    steps: 1) convert to datetime struct; and 2) return one field from the
>    datetime struct. (If I want to extract Year, Month, Day, is that three
>    function calls that *each* convert to datetime struct?)
>    - Extracting fields from int64+tz is awkward, because it's not obvious
>    which timezone is being used. (To extract fields in a custom timezone, must
>    I 1) clone the column with a new timezone; and 2) call the function?)
>
> My understanding of "best practice" for extracting multiple fields using
> Arrow's timestamp columns is:
>
> 1. Convert from timestamp column to date32 and/or time32/time64 columns in
> one pass (one of three operations, perhaps: timestamp=>date32,
> timestamp=>time64, or timestamp=>struct{date32,time64})
> 2. Extract fields from those date32 and time64 columns.
>
> Only step 1 needs a timezone. In C, the analogue is localtime().
>
> We do step 1 at Workbench -- see converttimestamptodate
> <https://github.com/CJWorkbench/converttimestamptodate/blob/main/converttimestamptodate.py>
> for
> our implementation. We haven't had much demand for step 2, so we'll get to
> it later.
>
> I think of this "best practice" as a compromise:
>
>    - date32+time64 aren't as time-efficient as C's struct tm, but together
>    they use 12 bytes whereas the C struct costs 50-100 bytes.
>    - date32+time64 are 50% less space-efficient than int64, but they're
>    intuitive and they save time.
>
> A small benchmark to prove that "save time" assertion in Python:
>
> >>> import datetime, os, time, timeit
> >>> os.environ['TZ'] = 'America/Montreal'
> >>> time.tzset()
> >>> timestamp = time.time()
> >>> timeit.timeit(lambda: datetime.date.fromtimestamp(timestamp).year)
> 0.2955563920113491
> >>> timeit.timeit(lambda: datetime.date(2021, 6, 15).year)  # baseline:
> timeit overhead + tuple construction
> 0.2509278700017603
>
> Most of the test is overhead; but certainly the timestamp=>date conversion
> takes time, and it's sane to try and minimize that overhead.
>
> Enjoy life,
> Adam
>
> --
> Adam Hooper
> +1-514-882-9694
> http://adamhooper.com

Re: [Format][Important] Needed clarification of timezone-less timestamps

Reply via email to