To take a step back to focus on some concrete issues Parquet has two timestamp types: with (UTC-normalized)/without time zone (non-UTC-normalized) https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L268
The SQL standard (e.g. PostgresSQL) has two timestamp types: with/without time zone — in some SQL implementations each slot can have a different time zone https://www.postgresql.org/docs/9.1/datatype-datetime.html WITHOUT TIME ZONE: "timestamp without time zone value should be taken or given as timezone local time" Spark / Databricks discusses how Spark handles this https://docs.databricks.com/spark/latest/dataframes-datasets/dates-timestamps.html#ansi-sql-and-spark-sql-timestamps * WITHOUT TIME ZONE: "These timestamps are not bound to any time zone, and are wall clock timestamps." — not UTC-normalized * WITH TIME ZONE: "does not affect the physical point in time that the timestamp represents, as that is fully represented by the UTC time instant given by the other timestamp components" pandas as discussed has non-UTC-normalized WITHOUT TIME ZONE "naive" timestamps and UTC-normalized WITH TIME ZONE. If we were to change Arrow's "WITHOUT TIMEZONE" semantics to be interpreted as UTC-normalized, that would force all of these other systems (and more) to serialize their data to be UTC-normalized (i.e. calling the equivalent of pandas's tz_localize function) when they convert to Arrow. This seems very harmful to me, and will make data from these systems not accurately representable in Arrow and unable to be round-tripped. Perhaps we can make a spreadsheet and look comprehensively at how many use cases would be disenfranchised by requiring UTC normalization always. On Tue, Jun 15, 2021 at 3:16 PM Adam Hooper <a...@adamhooper.com> wrote: > > On Tue, Jun 15, 2021 at 1:19 PM Weston Pace <weston.p...@gmail.com> wrote: > > > Arrow's "Timestamp with Timezone" can have fields extracted > > from it. > > > > Sure, one *can* extract fields from timestamp+tz. But I don't feel > timestamp+tz is *designed* for extracting fields: > > - Extracting fields from int64+tz is inefficient, because it bundles two > steps: 1) convert to datetime struct; and 2) return one field from the > datetime struct. (If I want to extract Year, Month, Day, is that three > function calls that *each* convert to datetime struct?) > - Extracting fields from int64+tz is awkward, because it's not obvious > which timezone is being used. (To extract fields in a custom timezone, must > I 1) clone the column with a new timezone; and 2) call the function?) > > My understanding of "best practice" for extracting multiple fields using > Arrow's timestamp columns is: > > 1. Convert from timestamp column to date32 and/or time32/time64 columns in > one pass (one of three operations, perhaps: timestamp=>date32, > timestamp=>time64, or timestamp=>struct{date32,time64}) > 2. Extract fields from those date32 and time64 columns. > > Only step 1 needs a timezone. In C, the analogue is localtime(). > > We do step 1 at Workbench -- see converttimestamptodate > <https://github.com/CJWorkbench/converttimestamptodate/blob/main/converttimestamptodate.py> > for > our implementation. We haven't had much demand for step 2, so we'll get to > it later. > > I think of this "best practice" as a compromise: > > - date32+time64 aren't as time-efficient as C's struct tm, but together > they use 12 bytes whereas the C struct costs 50-100 bytes. > - date32+time64 are 50% less space-efficient than int64, but they're > intuitive and they save time. > > A small benchmark to prove that "save time" assertion in Python: > > >>> import datetime, os, time, timeit > >>> os.environ['TZ'] = 'America/Montreal' > >>> time.tzset() > >>> timestamp = time.time() > >>> timeit.timeit(lambda: datetime.date.fromtimestamp(timestamp).year) > 0.2955563920113491 > >>> timeit.timeit(lambda: datetime.date(2021, 6, 15).year) # baseline: > timeit overhead + tuple construction > 0.2509278700017603 > > Most of the test is overhead; but certainly the timestamp=>date conversion > takes time, and it's sane to try and minimize that overhead. > > Enjoy life, > Adam > > -- > Adam Hooper > +1-514-882-9694 > http://adamhooper.com