Re: [Format][Important] Needed clarification of timezone-less timestamps

Joris Van den Bossche Tue, 15 Jun 2021 00:31:45 -0700

Some inline answers to Weston's email below:

On Tue, 15 Jun 2021 at 07:34, Weston Pace <weston.p...@gmail.com> wrote:
> ...
> Let's pretend two astronomers observe a meteoroid impact on the moon.
> We are talking about two different ways they can record the time.  The
> first method, universal time, is done by recording the seconds since
> the epoch.  The second, wall clock time, is done by writing down the
> time seen on a clock (and nearby calendar).
>
> In both cases we do not know the full picture without the time zone
> information.  If we have two universal times (but no time zones) we
> can say whether the two astronomers witnessed the same event (assuming
> the impact site is equal) but we can't say whether they saw it at the
> same time of day (e.g. whether the two astronomers had both just
> finished dinner).


That's the reason we have a TIMESTAMP WITH TIME ZONE type, with which
you can have this full picture.

> ...
> Rather than store wall clock time as a string (which is inefficient)
> Arrow stores wall clock time as the epoch timestamp at the point a
> wall clock in the UTC time zone would display the given time.  In
> other words, converting datetime.datetime.now to an Arrow timestamp
> does NOT give the current UNIX epoch.  The value that is stored is
> different for every time zone.  Or to put it yet another way.  The
> output of the following program...
>
> import pyarrow as pa
> import datetime
> pa.array([datetime.datetime.strptime('Jun 28 2018 7:40AM',
>          '%b %d %Y %I:%M%p')]).cast(pa.int64()).to_pylist()[0]
>
> ...will be identical on every machine.  But the output of...
>
> import pyarrow as pa
> import datetime
> pa.array([datetime.datetime.now()]).cast(pa.int64()).to_pylist()[0]
>
> ...will depend on the system time zone (ostensibly because the output
> of datetime.datetime.now() depends on the system time zone).

What you describe here is the behaviour of Python's datetime module,
not of Arrow. It's datetime.datetime.now() that is dependent on the
system time zone, but from Arrow's perspective, it just gets a naive
datetime in both cases, and handles those consistently.
So it's the responsibility of the user to decide whether they are OK
with the behaviour of datetime.datetime.now().

> ---
>
> So given my previous concrete example I said...
>
> > For each observation they record the unix timestamp (or maybe
> > they build up an instance of datetime objects created with
> > datetime.datetime.now())
>
> These two methods would actually yield different results.  If they
> created a pa.array([ts1, ts2], type=pa.timestamp('s')) with unix
> timestamps recorded at the time of the event then they would get the
> wrong histogram.
>
> If they created a pa.array([dt1, dt2], type=pa.timestamp('s')) with
> datetime.datetime objects created with datetime.datetime.now at the
> time of the event then they would get the correct histogram.

And is it a useful application that they can get the correct histogram
by using naive timestamps? It's probably debatable whether this is
"best practice", or whether it should rather be recommended to use
timestamps with timezones to get the same effect. But IMO it is not up
to Arrow to be opinionated in this "how should I use timezones"
debate, but to enable widespread behaviour/usage patterns for
downstream libraries.

(but I also don't fully understand your point here, as your "they
would get the correct histogram" seems to imply a positive statemenent
for tz-naive timestamps, while your email starts with a +1 on
Antoine's proposal which, as far as I understand it, says that
timestamps without timezone are useless / should be interpreted as UTC
instead (which makes your above described scenario impossible)).

> >
> > >  TIMESTAMP WITHOUT TIME ZONE: this is the case where the time zone
> > > field is not set. We have stated that we want systems to use
> > > system-locale-independent choices for functions that act on this data
> > > (like stringification or field extraction)
> >
> > This is indeed a rehash of an earlier discussion where I agreed with
> > you but I think I understand the subtleties a bit more and now I
> > disagree, particularly on field extraction.  Field extraction can be
> > done on a naive "datetime" without assuming UTC which I think makes it
> > safer for Python.  Field extraction cannot be done on a naive
> > "timestamp" without assuming UTC.
> >
> > # Stringification
> >
> > I think we can get away with stringification.  It seems like the
> > consensus is to always output UTC format.  I will point out that
> > pyarrow does not do that today.  Currently in pyarrow I get
> >
> > >>> pa.array([datetime.datetime.now()])
> > <pyarrow.lib.TimestampArray object at 0x7f8ae865d520>
> > [
> >   2021-06-14 17:30:52.260044  # Local time
> > ]

Pyarrow displays the tz-naive timestamp "as is" (as described in the
spec), so I think the above behaviour is correct. You created a naive
datetime representing your local time with datetime.datetime.now(),
and pyarrow preserves that information on the conversion, and displays
the data as is. It will give the same string representation as
printing the datetime.datetime object, and it will preserve the fields
of the datetime.datetime object:

>>> dt = datetime.datetime.now()
>>> dt
datetime.datetime(2021, 6, 15, 9, 18, 48, 108988)
>>> print(dt)
2021-06-15 09:18:48.108988

>>> arr = pa.array([dt])
>>> arr
<pyarrow.lib.TimestampArray object at 0x7ffa35459d60>
[
  2021-06-15 09:18:48.108988
]
>>> pc.hour(arr)
<pyarrow.lib.Int64Array object at 0x7ff9ecc7c340>
[
  9
]

> >
> > # Field extraction
> >
> > Here is a concrete example demonstrating the problems of field
> > extraction.  Consider a user that runs an experiment over several
> > weeks.  For each observation they record the unix timestamp (or maybe
> > they build up an instance of datetime objects created with
> > datetime.datetime.now()).  Then, using Arrow as a backend for
> > analysis, they create a histogram to show events by weekday.   If
> > Arrow is assuming UTC then the histogram is going to have the wrong
> > days of the week (unless the user happens to be in UTC).
> >
> > Simple queries like "Give me all events that happened on Tuesday" or
> > "Group rows by year" will not necessarily work on naive columns in the
> > way that a user expects (and yet these only require field extraction).

If Arrow doesn't assume UTC for timestamps without timezone (as it
does now), then those queries will do what a user expects (it gives
the field that matches the local time). See my "pc.hour(..)" field
extraction example (full example above):

>>> pc.hour(arr)
<pyarrow.lib.Int64Array object at 0x7ff9ecc7c340>
[
  9
]

This currently works on master (and `arr` is a timestamp without
timezone), and IMO gives the expected behaviour for the user.

> >
> > So, my particular resolution (what I am arguing for), is that arrow
> > libraries that perform field extraction should return an error when
> > presented with a timestamp that does not have a timezone.
> >

Re: [Format][Important] Needed clarification of timezone-less timestamps

Reply via email to