Re: [Format][Important] Needed clarification of timezone-less timestamps

Weston Pace Thu, 17 Jun 2021 13:39:08 -0700

If a system does not store a local datetime using the UTC-normalized
representation and they put it in an Arrow timestamp column without
timezone then how should an Arrow compute function extract a field.


For a concrete example, let's assume I have the number 172800000 in a
timestamp(ms) column with no time zone and the user has asked to
extract the day of month.  I use 172800000 because it is in the
parquet docs example[1].  I thought I could assume that the source
system had normalized this value to UTC and so I could run something
like `datetime.fromtimestamp(172800).day` and find out that it is 2.

Perhaps, more concretely:

There are many ways that one could store a datetime into a single
number.  The parquet docs mention two different ways but they are
really the same thing, figure out the epoch timestamp for that
datetime in the UTC timezone (the instant at which a wall clock in UTC
would show the desired wall clock time).  With this method the
datetime (1970, 1, 2, 14, 0) is stored as 0x000000000A4CB800
(172800000, assuming ms). So let's invent a third way.  I could use
the first 16 bits for the year, the next 8 bits for the month, the
next 8 bits for the day of month, the next 8 bits for the hour, the
next 8 bits for the minute, and the remaining bits for the seconds.
Using this method I would store (1970, 1, 2, 14, 0) as
0x07B201020E000000.

If I understand your argument correctly it is that Arrow is not going
to govern how these other systems encode a local datetime into an 8
byte value and so both of those are valid representations of (1970, 1,
2, 14, 0).  As a result, there would be no possible way to write a
uniform kernel for field extraction that would work in Arrow.

Am I understanding you correctly?  Or have I misinterpted things again
as I've already done that several times on this thread alone :)

[1] 
https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#local-semantics-timestamps-not-normalized-to-utc

On Thu, Jun 17, 2021 at 8:59 AM Wes McKinney <wesmck...@gmail.com> wrote:
>
> To take a step back to focus on some concrete issues
>
> Parquet has two timestamp types: with (UTC-normalized)/without time
> zone (non-UTC-normalized)
> https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L268
>
> The SQL standard (e.g. PostgresSQL) has two timestamp types:
> with/without time zone — in some SQL implementations each slot can
> have a different time zone
> https://www.postgresql.org/docs/9.1/datatype-datetime.html
> WITHOUT TIME ZONE: "timestamp without time zone value should be taken
> or given as timezone local time"
>
> Spark / Databricks discusses how Spark handles this
> https://docs.databricks.com/spark/latest/dataframes-datasets/dates-timestamps.html#ansi-sql-and-spark-sql-timestamps
> * WITHOUT TIME ZONE: "These timestamps are not bound to any time zone,
> and are wall clock timestamps." — not UTC-normalized
> * WITH TIME ZONE: "does not affect the physical point in time that the
> timestamp represents, as that is fully represented by the UTC time
> instant given by the other timestamp components"
>
> pandas as discussed has non-UTC-normalized WITHOUT TIME ZONE "naive"
> timestamps and UTC-normalized WITH TIME ZONE.
>
> If we were to change Arrow's "WITHOUT TIMEZONE" semantics to be
> interpreted as UTC-normalized, that would force all of these other
> systems (and more) to serialize their data to be UTC-normalized (i.e.
> calling the equivalent of pandas's tz_localize function) when they
> convert to Arrow. This seems very harmful to me, and will make data
> from these systems not accurately representable in Arrow and unable to
> be round-tripped.
>
> Perhaps we can make a spreadsheet and look comprehensively at how many
> use cases would be disenfranchised by requiring UTC normalization
> always.
>
> On Tue, Jun 15, 2021 at 3:16 PM Adam Hooper <a...@adamhooper.com> wrote:
> >
> > On Tue, Jun 15, 2021 at 1:19 PM Weston Pace <weston.p...@gmail.com> wrote:
> >
> > > Arrow's "Timestamp with Timezone" can have fields extracted
> > > from it.
> > >
> >
> > Sure, one *can* extract fields from timestamp+tz. But I don't feel
> > timestamp+tz is *designed* for extracting fields:
> >
> >    - Extracting fields from int64+tz is inefficient, because it bundles two
> >    steps: 1) convert to datetime struct; and 2) return one field from the
> >    datetime struct. (If I want to extract Year, Month, Day, is that three
> >    function calls that *each* convert to datetime struct?)
> >    - Extracting fields from int64+tz is awkward, because it's not obvious
> >    which timezone is being used. (To extract fields in a custom timezone, 
> > must
> >    I 1) clone the column with a new timezone; and 2) call the function?)
> >
> > My understanding of "best practice" for extracting multiple fields using
> > Arrow's timestamp columns is:
> >
> > 1. Convert from timestamp column to date32 and/or time32/time64 columns in
> > one pass (one of three operations, perhaps: timestamp=>date32,
> > timestamp=>time64, or timestamp=>struct{date32,time64})
> > 2. Extract fields from those date32 and time64 columns.
> >
> > Only step 1 needs a timezone. In C, the analogue is localtime().
> >
> > We do step 1 at Workbench -- see converttimestamptodate
> > <https://github.com/CJWorkbench/converttimestamptodate/blob/main/converttimestamptodate.py>
> > for
> > our implementation. We haven't had much demand for step 2, so we'll get to
> > it later.
> >
> > I think of this "best practice" as a compromise:
> >
> >    - date32+time64 aren't as time-efficient as C's struct tm, but together
> >    they use 12 bytes whereas the C struct costs 50-100 bytes.
> >    - date32+time64 are 50% less space-efficient than int64, but they're
> >    intuitive and they save time.
> >
> > A small benchmark to prove that "save time" assertion in Python:
> >
> > >>> import datetime, os, time, timeit
> > >>> os.environ['TZ'] = 'America/Montreal'
> > >>> time.tzset()
> > >>> timestamp = time.time()
> > >>> timeit.timeit(lambda: datetime.date.fromtimestamp(timestamp).year)
> > 0.2955563920113491
> > >>> timeit.timeit(lambda: datetime.date(2021, 6, 15).year)  # baseline:
> > timeit overhead + tuple construction
> > 0.2509278700017603
> >
> > Most of the test is overhead; but certainly the timestamp=>date conversion
> > takes time, and it's sane to try and minimize that overhead.
> >
> > Enjoy life,
> > Adam
> >
> > --
> > Adam Hooper
> > +1-514-882-9694
> > http://adamhooper.com

Re: [Format][Important] Needed clarification of timezone-less timestamps

Reply via email to