Re: [Format][Important] Needed clarification of timezone-less timestamps

Weston Pace Tue, 22 Jun 2021 10:39:37 -0700

There's two (related) proposals that are being discussed I think.  I
created streamline documents for each.  Feel free to add arguments
for/against or to clarify.  If no one has any better ideas I'll start
a vote topic for each one in ~48 hours.


# Proposal: Clarify meaning of timestamp without time zone
 * 
https://docs.google.com/document/d/1wDAuxEDVo3YxZx20fGUGqQxi3aoss7TJ-TzOUjaoZk8/edit?usp=sharing

# Proposal: Arrow should define how an “Instant” is stored
 * 
https://docs.google.com/document/d/1xEKRhs-GUSMwjMhgmQdnCNMXwZrA10226AcXRoP8g9E/edit?usp=sharing


On Tue, Jun 22, 2021 at 6:01 AM Julian Hyde <jhyde.apa...@gmail.com> wrote:
>
> My proposal is that Arrow should support three different kinds of date-times: 
> zoneless, zoned, and instant. (Not necessarily with those names.)
>
> All three kinds occur frequently in the industry.
>
> Many systems only have two, and users of those systems have figured out how 
> to make do. (For example, you can implement an instant using a zoneless, and 
> vice versa, if you are careful.) But let’s suppose that Arrow has two of the 
> three, and needs to interoperate with a system that has a different two of 
> the three. Chaos ensues.
>
> Let’s just implement all three.
>
> Julian
>
>
>
>
> > On Jun 22, 2021, at 8:46 AM, Wes McKinney <wesmck...@gmail.com> wrote:
> >
> > Let's see a streamlined document about what we are voting on — I have
> > limited bandwidth to read through and synthesize the discussion myself
> > and I am probably not the only one. It has always been my
> > understanding to represent time elapsed from the UNIX epoch
> > (1970-01-01 00:00:00 UTC)  as a timestamp with tz=UTC (which I believe
> > is what is being called an "instant").
> >
> > On Tue, Jun 22, 2021 at 9:01 AM Adam Hooper <a...@adamhooper.com> wrote:
> >>
> >> Maybe Arrow should add a new type, "INSTANT"?
> >>
> >> Instant (seconds since the epoch) is the gold standard in storing moments
> >> in time. All programming languages; all RDBMSs; Parquet ... everybody uses
> >> this.
> >>
> >> I use Instants, too. I interpreted TIMESTAMP with no metadata to mean
> >> Instant because I read the docs *assuming* Arrow stores Instant.
> >>
> >> I know, I know, no vote can stop me from interpreting "timestamp without
> >> timezone" however the heck I want. But it's painful for me to transition
> >> from happy user to heretic.
> >>
> >> Voting to clarify that Arrow doesn't store Instants is voting to clarify
> >> that Arrow *doesn't* do something extremely useful. It's voting for a
> >> negative. That sounds painful! What if there were positives to vote for? An
> >> "INSTANT" type? A new TIMESTAMP metadata field, "instant" (on by default)?
> >> A fiat that timezone=UTC means Instant, not ZonedDateTime?
> >>
> >> Enjoy life,
> >> Adam
> >>
> >> On Mon, Jun 21, 2021 at 11:39 PM Weston Pace <weston.p...@gmail.com> wrote:
> >>
> >>> I agree that a vote would be a good idea.  Do you want to start a
> >>> dedicated vote thread?  I can write one up too if you'd rather.
> >>>
> >>> -Weston
> >>>
> >>> On Mon, Jun 21, 2021 at 4:54 PM Micah Kornfield <emkornfi...@gmail.com>
> >>> wrote:
> >>>>
> >>>> I think comments on the doc are tailing off.  Jorge's test cases I think
> >>>> still need some more careful analysis but Weston has provided an
> >>>> initial pass.
> >>>>
> >>>> The matter not resolved on the document is whether Timestamp with
> >>> timezone
> >>>> logically represents multi-field date and time (that does not represent a
> >>>> specific instant) or whether it logically represents an instant (some
> >>>> measurable offset from an epoch).   Based on comments on the
> >>> documentation
> >>>> both C++/Python implementations and the Java implementations (those that
> >>>> have generally been considered "reference") both have evidence the the
> >>>> former representation is what is intended (some links are in the
> >>> document).
> >>>>
> >>>> We can probably continue to debate what is useful but it seems ultimately
> >>>> we need to pick one or the other and clarify the specification.  Given
> >>> how
> >>>> the reference implementations currently work I think we should error on
> >>> the
> >>>> side of interpreting these values as date times.  Ultimately, given the
> >>>> contention here we will likely need to vote on this.
> >>>>
> >>>> More comments on the document or here are still useful in case we've
> >>> missed
> >>>> an interpretation or there are other facts to consider.
> >>>>
> >>>> Cheers,
> >>>> Micah
> >>>>
> >>>> On Thu, Jun 17, 2021 at 9:59 PM Jorge Cardoso Leitão <
> >>>> jorgecarlei...@gmail.com> wrote:
> >>>>
> >>>>> Thank you everyone for participating so far; really important and
> >>>>> useful discussion.
> >>>>>
> >>>>> I think of this discussion as a set of test cases over behavior:
> >>>>>
> >>>>> parameterization:
> >>>>> * Timestamp(ms, None)
> >>>>> * Timestamp(ms, "00:00")
> >>>>> * Timestamp(ms, "01:00")
> >>>>>
> >>>>> Cases:
> >>>>> * its string representation equals to
> >>>>> * add a duration equals to
> >>>>> * add an interval equals to
> >>>>> * subtract a Timestamp(ms, None) equals to
> >>>>> * subtract a Timestamp(ms, "01:00") equals to
> >>>>> * subtract a Date32 equals to
> >>>>> * subtract a Time32(ms) equals to
> >>>>> * extract the day equals to
> >>>>> * extract the timezone equals to
> >>>>> * cast to Timestamp(ms, None) equals to
> >>>>> * cast to Timestamp(ms, "01:00") equals to
> >>>>> * write to parquet v2 equals to (physical value and logical type)
> >>>>>
> >>>>> In all cases, the result may either be valid or invalid. If valid, we
> >>>>> would need a datatype and an actual value.
> >>>>> I was hoping to be able to answer each of the above at the end of this
> >>>>> discussion.
> >>>>>
> >>>>> I've suggested adding these in the google docs.
> >>>>>
> >>>>> Best,
> >>>>> Jorge
> >>>>>
> >>>>> On Fri, Jun 18, 2021 at 12:15 AM Micah Kornfield <
> >>> emkornfi...@gmail.com>
> >>>>> wrote:
> >>>>>>
> >>>>>> I've posted the examples above in
> >>>>>>
> >>>>>
> >>> https://docs.google.com/document/d/1QDwX4ypfNvESc2ywcT1ygaf2Y1R8SmkpifMV7gpJdBI/edit?usp=sharing
> >>>>>> because I think it would be better to collaborate there instead of
> >>> linear
> >>>>>> e-mail history and then bring the consensus back to the list.
> >>>>>>
> >>>>>> On Thu, Jun 17, 2021 at 2:56 PM Micah Kornfield <
> >>> emkornfi...@gmail.com>
> >>>>>> wrote:
> >>>>>>
> >>>>>>> I feel like we might still be talking past each other here or at
> >>> least
> >>>>> I
> >>>>>>> don't understand the two sides of this.  I'll try to expand
> >>> Weston's
> >>>>>>> example because I think it provides the best clarification.
> >>>>>>>
> >>>>>>> (1970, 1, 2, 14, 0) is stored as 0x000000000A4CB800 (172800000,
> >>>>> assuming
> >>>>>>> ms) for a timestamp column without timezone (always).   This
> >>>>> represents an
> >>>>>>> offset from the unix epoch.  This interpretation should not change
> >>>>> based on
> >>>>>>> the local system timezone.  Extracting the hour field always
> >>> yields 14
> >>>>>>> (extraction is done relative to UTC).
> >>>>>>>
> >>>>>>> The alternative here seems to be that we can encode (1970, 1, 2,
> >>> 14,
> >>>>> 0) in
> >>>>>>> multiple different ways depending on what the current local system
> >>> time
> >>>>>>> is.  As a note, I think ORC and Spark do this, and it leads to
> >>>>>>> confusion/misinterpretation when trying to transfer data.
> >>>>>>>
> >>>>>>> If we then convert this column to a timestamp with a timezone in
> >>> "UTC"
> >>>>>>> timezone extracting the hour field still yields 14.  If the column
> >>> is
> >>>>>>> converted to Timezone with timestamp PST.  Extracting an hour would
> >>>>> yield 6
> >>>>>>> (assume PST = -8GMT).    Through all of these changes the data
> >>> bits do
> >>>>> not
> >>>>>>> change.
> >>>>>>>
> >>>>>>> Display is not mentioned because I think the points about how a
> >>> time
> >>>>>>> display is correct. Applications can choose what they feel makes
> >>> sense
> >>>>> to
> >>>>>>> them (as long as they don't start automatically tacking on
> >>> timezones to
> >>>>>>> naive timestamps).  My interpretation of the specification has been
> >>>>> display
> >>>>>>> was kind of shorthand for field extraction.
> >>>>>>>
> >>>>>>> Could others on the thread confirm this is the issue up for debate?
> >>>>> Are
> >>>>>>> there subtleties/operations we need to consider?
> >>>>>>>
> >>>>>>> I also agree that we should document recommended conversion
> >>> practices
> >>>>> from
> >>>>>>> other systems.
> >>>>>>>
> >>>>>>> -Micah
> >>>>>>>
> >>>>>>>
> >>>>>>> So let's invent a third way.  I could use
> >>>>>>>> the first 16 bits for the year, the next 8 bits for the month, the
> >>>>>>>> next 8 bits for the day of month, the next 8 bits for the hour,
> >>> the
> >>>>>>>> next 8 bits for the minute, and the remaining bits for the
> >>> seconds.
> >>>>>>>> Using this method I would store (1970, 1, 2, 14, 0) as
> >>>>>>>> 0x07B201020E000000.
> >>>>>>>
> >>>>>>> Aside, With some small variation this is what ZetaSql uses [2]
> >>>>>>>
> >>>>>>> [1]
> >>>>>>>
> >>>>>
> >>> https://arrow.apache.org/docs/python/timestamps.html#pandas-arrow-spark
> >>>>>>> [2]
> >>>>>>>
> >>>>>
> >>> https://github.com/google/zetasql/blob/master/zetasql/public/civil_time.h#L62
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>> On Thu, Jun 17, 2021 at 1:58 PM Adam Hooper <a...@adamhooper.com>
> >>>>> wrote:
> >>>>>>>
> >>>>>>>> On Thu, Jun 17, 2021 at 2:59 PM Wes McKinney <wesmck...@gmail.com
> >>>>
> >>>>> wrote:
> >>>>>>>>
> >>>>>>>>>
> >>>>>>>>> The SQL standard (e.g. PostgresSQL) has two timestamp types:
> >>>>>>>>> with/without time zone — in some SQL implementations each slot
> >>> can
> >>>>>>>>> have a different time zone
> >>>>>>>>> https://www.postgresql.org/docs/9.1/datatype-datetime.html
> >>>>>>>>> WITHOUT TIME ZONE: "timestamp without time zone value should be
> >>>>> taken
> >>>>>>>>> or given as timezone local time"
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>> RDBMSs conflict (universally) with ANSI.
> >>>>>>>>
> >>>>>>>> PostgreSQL TIMESTAMP WITH TIME ZONE is 64-bit int Instant since
> >>> the
> >>>>> epoch.
> >>>>>>>> It has no timezone.
> >>>>>>>>
> >>>>>>>> MySQL/MariaDB/BigTable/[your fork here] TIMESTAMP is also an int
> >>>>> Instant
> >>>>>>>> since the epoch. It has no timezone.
> >>>>>>>>
> >>>>>>>> TIMESTAMP *WITHOUT* TIME ZONE is indeed akin to Numpy "naive
> >>>>> datetime" in
> >>>>>>>> *function*, but not in implementation:
> >>>>>>>>
> >>>>>>>>   - MySQL DATETIME
> >>>>>>>>   <
> >>>>>>>>
> >>>>>
> >>> https://dev.mysql.com/doc/internals/en/date-and-time-data-type-representation.html
> >>>>>>>>>
> >>>>>>>>   is weird: 1-bit sign, 17-bit month, 5-bit day, ....
> >>>>>>>>   - MSSQL
> >>>>>>>>   <
> >>>>>>>>
> >>>>>
> >>> https://docs.microsoft.com/en-us/sql/t-sql/data-types/datetime2-transact-sql?view=sql-server-ver15
> >>>>>>>>>
> >>>>>>>>   uses 6, 7 or 8 bytes
> >>>>>>>>   - PostgreSQL stores an integer, but I think its epoch is still
> >>>>>>>> different
> >>>>>>>>   <
> >>>>>>>>
> >>>>>
> >>> https://github.com/postgres/postgres/blob/c30f54ad732ca5c8762bb68bbe0f51de9137dd72/src/include/datatype/timestamp.h
> >>>>>>>>>
> >>>>>>>> (it
> >>>>>>>>   used to store doubles since 2000-01-01)
> >>>>>>>>
> >>>>>>>> ... so in general, moving datetimes from these systems into 64-bit
> >>>>>>>> integers
> >>>>>>>> is nontrivial and lossy.
> >>>>>>>>
> >>>>>>>> Spark / Databricks discusses how Spark handles this
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>
> >>>>>
> >>> https://docs.databricks.com/spark/latest/dataframes-datasets/dates-timestamps.html#ansi-sql-and-spark-sql-timestamps
> >>>>>>>>> * WITHOUT TIME ZONE: "These timestamps are not bound to any time
> >>>>> zone,
> >>>>>>>>> and are wall clock timestamps." — not UTC-normalized
> >>>>>>>>> * WITH TIME ZONE: "does not affect the physical point in time
> >>> that
> >>>>> the
> >>>>>>>>> timestamp represents, as that is fully represented by the UTC
> >>> time
> >>>>>>>>> instant given by the other timestamp components"
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>> I don't use Spark, but I read that page twice. First reading, I
> >>> got
> >>>>> the
> >>>>>>>> same thing out of it. But the second time I read it, I read the
> >>>>> opposite!
> >>>>>>>>
> >>>>>>>> The key part is: "*Spark SQL defines the timestamp type as
> >>> TIMESTAMP
> >>>>> WITH
> >>>>>>>> SESSION TIME ZONE*," -- in other words, Spark doesn't have a
> >>> TIMESTAMP
> >>>>>>>> WITH
> >>>>>>>> TIME ZONE *or* TIMESTAMP WITHOUT TIME ZONE. It has just one
> >>> Timestamp
> >>>>>>>> type:
> >>>>>>>> a 64-bit Instant since the epoch. (It also has a Date type.)
> >>>>>>>>
> >>>>>>>> If I'm reading correctly, this is exactly the same as PostgreSQL
> >>>>> TIMESTAMP
> >>>>>>>> WITH TIME ZONE and MySQL TIMESTAMP: servers always store and
> >>> transmit
> >>>>>>>> timestamps as bare 64-bit integers since the epoch -- without
> >>>>> timezone.
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>> If we were to change Arrow's "WITHOUT TIMEZONE" semantics to be
> >>>>>>>>> interpreted as UTC-normalized, that would force all of these
> >>> other
> >>>>>>>>> systems (and more) to serialize their data to be UTC-normalized
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> Are those systems' 64-bit integers interoperable in the first
> >>> place?
> >>>>>>>>
> >>>>>>>> As I understand it, there's a ton of variance out there when
> >>> encoding
> >>>>>>>> datetimes as 64-bit integers. The Numpy+Arrow+Parquet+Postgres
> >>>>> encoding
> >>>>>>>> is one of many. As I mentioned in another thread, programming
> >>>>> languages
> >>>>>>>> all
> >>>>>>>> use structs.
> >>>>>>>>
> >>>>>>>> pandas as discussed has non-UTC-normalized WITHOUT TIME ZONE
> >>> "naive"
> >>>>>>>>> timestamps and UTC-normalized WITH TIME ZONE.
> >>>>>>>>>
> >>>>>>>>> If we were to change Arrow's "WITHOUT TIMEZONE" semantics to be
> >>>>>>>>> interpreted as UTC-normalized, that would force all of these
> >>> other
> >>>>>>>>> systems (and more) to serialize their data to be UTC-normalized
> >>>>> (i.e.
> >>>>>>>>> calling the equivalent of pandas's tz_localize function) when
> >>> they
> >>>>>>>>> convert to Arrow.
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> Alternatives:
> >>>>>>>>
> >>>>>>>>   - int64
> >>>>>>>>   - date32+time64
> >>>>>>>>   - date32+time32
> >>>>>>>>
> >>>>>>>> This seems very harmful to me, and will make data
> >>>>>>>>> from these systems not accurately representable in Arrow and
> >>> unable
> >>>>> to
> >>>>>>>>> be round-tripped.
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>> Certainly nobody wants to go backwards.
> >>>>>>>>
> >>>>>>>> We need to clarify: how do we store these *common* types -- MySQL
> >>>>>>>> TIMESTAMP, Spark TIMESTAMP, PostgreSQL TIMESTAMP WITH TIME ZONE
> >>> -- in
> >>>>>>>> Arrow?
> >>>>>>>>
> >>>>>>>> Secondarily, I think: how do we recommend users store *datetimes*
> >>> in
> >>>>>>>> Arrow?
> >>>>>>>> (I'd expect this to be messier, since every system/language uses a
> >>>>>>>> different byte structure.)
> >>>>>>>>
> >>>>>>>> Perhaps we can make a spreadsheet and look comprehensively at how
> >>> many
> >>>>>>>>> use cases would be disenfranchised by requiring UTC
> >>> normalization
> >>>>>>>>> always.
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> Hear, hear!
> >>>>>>>>
> >>>>>>>> Can we also poll people to find out how they're storing Instants
> >>>>> today?
> >>>>>>>>
> >>>>>>>> Enjoy life,
> >>>>>>>> Adam
> >>>>>>>>
> >>>>>>>> --
> >>>>>>>> Adam Hooper
> >>>>>>>> +1-514-882-9694
> >>>>>>>> http://adamhooper.com
> >>>>>>>>
> >>>>>>>
> >>>>>
> >>>
> >>
> >>
> >> --
> >> Adam Hooper
> >> +1-514-882-9694
> >> http://adamhooper.com
>

Re: [Format][Important] Needed clarification of timezone-less timestamps

Reply via email to