There's two (related) proposals that are being discussed I think. I created streamline documents for each. Feel free to add arguments for/against or to clarify. If no one has any better ideas I'll start a vote topic for each one in ~48 hours.
# Proposal: Clarify meaning of timestamp without time zone * https://docs.google.com/document/d/1wDAuxEDVo3YxZx20fGUGqQxi3aoss7TJ-TzOUjaoZk8/edit?usp=sharing # Proposal: Arrow should define how an “Instant” is stored * https://docs.google.com/document/d/1xEKRhs-GUSMwjMhgmQdnCNMXwZrA10226AcXRoP8g9E/edit?usp=sharing On Tue, Jun 22, 2021 at 6:01 AM Julian Hyde <jhyde.apa...@gmail.com> wrote: > > My proposal is that Arrow should support three different kinds of date-times: > zoneless, zoned, and instant. (Not necessarily with those names.) > > All three kinds occur frequently in the industry. > > Many systems only have two, and users of those systems have figured out how > to make do. (For example, you can implement an instant using a zoneless, and > vice versa, if you are careful.) But let’s suppose that Arrow has two of the > three, and needs to interoperate with a system that has a different two of > the three. Chaos ensues. > > Let’s just implement all three. > > Julian > > > > > > On Jun 22, 2021, at 8:46 AM, Wes McKinney <wesmck...@gmail.com> wrote: > > > > Let's see a streamlined document about what we are voting on — I have > > limited bandwidth to read through and synthesize the discussion myself > > and I am probably not the only one. It has always been my > > understanding to represent time elapsed from the UNIX epoch > > (1970-01-01 00:00:00 UTC) as a timestamp with tz=UTC (which I believe > > is what is being called an "instant"). > > > > On Tue, Jun 22, 2021 at 9:01 AM Adam Hooper <a...@adamhooper.com> wrote: > >> > >> Maybe Arrow should add a new type, "INSTANT"? > >> > >> Instant (seconds since the epoch) is the gold standard in storing moments > >> in time. All programming languages; all RDBMSs; Parquet ... everybody uses > >> this. > >> > >> I use Instants, too. I interpreted TIMESTAMP with no metadata to mean > >> Instant because I read the docs *assuming* Arrow stores Instant. > >> > >> I know, I know, no vote can stop me from interpreting "timestamp without > >> timezone" however the heck I want. But it's painful for me to transition > >> from happy user to heretic. > >> > >> Voting to clarify that Arrow doesn't store Instants is voting to clarify > >> that Arrow *doesn't* do something extremely useful. It's voting for a > >> negative. That sounds painful! What if there were positives to vote for? An > >> "INSTANT" type? A new TIMESTAMP metadata field, "instant" (on by default)? > >> A fiat that timezone=UTC means Instant, not ZonedDateTime? > >> > >> Enjoy life, > >> Adam > >> > >> On Mon, Jun 21, 2021 at 11:39 PM Weston Pace <weston.p...@gmail.com> wrote: > >> > >>> I agree that a vote would be a good idea. Do you want to start a > >>> dedicated vote thread? I can write one up too if you'd rather. > >>> > >>> -Weston > >>> > >>> On Mon, Jun 21, 2021 at 4:54 PM Micah Kornfield <emkornfi...@gmail.com> > >>> wrote: > >>>> > >>>> I think comments on the doc are tailing off. Jorge's test cases I think > >>>> still need some more careful analysis but Weston has provided an > >>>> initial pass. > >>>> > >>>> The matter not resolved on the document is whether Timestamp with > >>> timezone > >>>> logically represents multi-field date and time (that does not represent a > >>>> specific instant) or whether it logically represents an instant (some > >>>> measurable offset from an epoch). Based on comments on the > >>> documentation > >>>> both C++/Python implementations and the Java implementations (those that > >>>> have generally been considered "reference") both have evidence the the > >>>> former representation is what is intended (some links are in the > >>> document). > >>>> > >>>> We can probably continue to debate what is useful but it seems ultimately > >>>> we need to pick one or the other and clarify the specification. Given > >>> how > >>>> the reference implementations currently work I think we should error on > >>> the > >>>> side of interpreting these values as date times. Ultimately, given the > >>>> contention here we will likely need to vote on this. > >>>> > >>>> More comments on the document or here are still useful in case we've > >>> missed > >>>> an interpretation or there are other facts to consider. > >>>> > >>>> Cheers, > >>>> Micah > >>>> > >>>> On Thu, Jun 17, 2021 at 9:59 PM Jorge Cardoso Leitão < > >>>> jorgecarlei...@gmail.com> wrote: > >>>> > >>>>> Thank you everyone for participating so far; really important and > >>>>> useful discussion. > >>>>> > >>>>> I think of this discussion as a set of test cases over behavior: > >>>>> > >>>>> parameterization: > >>>>> * Timestamp(ms, None) > >>>>> * Timestamp(ms, "00:00") > >>>>> * Timestamp(ms, "01:00") > >>>>> > >>>>> Cases: > >>>>> * its string representation equals to > >>>>> * add a duration equals to > >>>>> * add an interval equals to > >>>>> * subtract a Timestamp(ms, None) equals to > >>>>> * subtract a Timestamp(ms, "01:00") equals to > >>>>> * subtract a Date32 equals to > >>>>> * subtract a Time32(ms) equals to > >>>>> * extract the day equals to > >>>>> * extract the timezone equals to > >>>>> * cast to Timestamp(ms, None) equals to > >>>>> * cast to Timestamp(ms, "01:00") equals to > >>>>> * write to parquet v2 equals to (physical value and logical type) > >>>>> > >>>>> In all cases, the result may either be valid or invalid. If valid, we > >>>>> would need a datatype and an actual value. > >>>>> I was hoping to be able to answer each of the above at the end of this > >>>>> discussion. > >>>>> > >>>>> I've suggested adding these in the google docs. > >>>>> > >>>>> Best, > >>>>> Jorge > >>>>> > >>>>> On Fri, Jun 18, 2021 at 12:15 AM Micah Kornfield < > >>> emkornfi...@gmail.com> > >>>>> wrote: > >>>>>> > >>>>>> I've posted the examples above in > >>>>>> > >>>>> > >>> https://docs.google.com/document/d/1QDwX4ypfNvESc2ywcT1ygaf2Y1R8SmkpifMV7gpJdBI/edit?usp=sharing > >>>>>> because I think it would be better to collaborate there instead of > >>> linear > >>>>>> e-mail history and then bring the consensus back to the list. > >>>>>> > >>>>>> On Thu, Jun 17, 2021 at 2:56 PM Micah Kornfield < > >>> emkornfi...@gmail.com> > >>>>>> wrote: > >>>>>> > >>>>>>> I feel like we might still be talking past each other here or at > >>> least > >>>>> I > >>>>>>> don't understand the two sides of this. I'll try to expand > >>> Weston's > >>>>>>> example because I think it provides the best clarification. > >>>>>>> > >>>>>>> (1970, 1, 2, 14, 0) is stored as 0x000000000A4CB800 (172800000, > >>>>> assuming > >>>>>>> ms) for a timestamp column without timezone (always). This > >>>>> represents an > >>>>>>> offset from the unix epoch. This interpretation should not change > >>>>> based on > >>>>>>> the local system timezone. Extracting the hour field always > >>> yields 14 > >>>>>>> (extraction is done relative to UTC). > >>>>>>> > >>>>>>> The alternative here seems to be that we can encode (1970, 1, 2, > >>> 14, > >>>>> 0) in > >>>>>>> multiple different ways depending on what the current local system > >>> time > >>>>>>> is. As a note, I think ORC and Spark do this, and it leads to > >>>>>>> confusion/misinterpretation when trying to transfer data. > >>>>>>> > >>>>>>> If we then convert this column to a timestamp with a timezone in > >>> "UTC" > >>>>>>> timezone extracting the hour field still yields 14. If the column > >>> is > >>>>>>> converted to Timezone with timestamp PST. Extracting an hour would > >>>>> yield 6 > >>>>>>> (assume PST = -8GMT). Through all of these changes the data > >>> bits do > >>>>> not > >>>>>>> change. > >>>>>>> > >>>>>>> Display is not mentioned because I think the points about how a > >>> time > >>>>>>> display is correct. Applications can choose what they feel makes > >>> sense > >>>>> to > >>>>>>> them (as long as they don't start automatically tacking on > >>> timezones to > >>>>>>> naive timestamps). My interpretation of the specification has been > >>>>> display > >>>>>>> was kind of shorthand for field extraction. > >>>>>>> > >>>>>>> Could others on the thread confirm this is the issue up for debate? > >>>>> Are > >>>>>>> there subtleties/operations we need to consider? > >>>>>>> > >>>>>>> I also agree that we should document recommended conversion > >>> practices > >>>>> from > >>>>>>> other systems. > >>>>>>> > >>>>>>> -Micah > >>>>>>> > >>>>>>> > >>>>>>> So let's invent a third way. I could use > >>>>>>>> the first 16 bits for the year, the next 8 bits for the month, the > >>>>>>>> next 8 bits for the day of month, the next 8 bits for the hour, > >>> the > >>>>>>>> next 8 bits for the minute, and the remaining bits for the > >>> seconds. > >>>>>>>> Using this method I would store (1970, 1, 2, 14, 0) as > >>>>>>>> 0x07B201020E000000. > >>>>>>> > >>>>>>> Aside, With some small variation this is what ZetaSql uses [2] > >>>>>>> > >>>>>>> [1] > >>>>>>> > >>>>> > >>> https://arrow.apache.org/docs/python/timestamps.html#pandas-arrow-spark > >>>>>>> [2] > >>>>>>> > >>>>> > >>> https://github.com/google/zetasql/blob/master/zetasql/public/civil_time.h#L62 > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> On Thu, Jun 17, 2021 at 1:58 PM Adam Hooper <a...@adamhooper.com> > >>>>> wrote: > >>>>>>> > >>>>>>>> On Thu, Jun 17, 2021 at 2:59 PM Wes McKinney <wesmck...@gmail.com > >>>> > >>>>> wrote: > >>>>>>>> > >>>>>>>>> > >>>>>>>>> The SQL standard (e.g. PostgresSQL) has two timestamp types: > >>>>>>>>> with/without time zone — in some SQL implementations each slot > >>> can > >>>>>>>>> have a different time zone > >>>>>>>>> https://www.postgresql.org/docs/9.1/datatype-datetime.html > >>>>>>>>> WITHOUT TIME ZONE: "timestamp without time zone value should be > >>>>> taken > >>>>>>>>> or given as timezone local time" > >>>>>>>>> > >>>>>>>> > >>>>>>>> RDBMSs conflict (universally) with ANSI. > >>>>>>>> > >>>>>>>> PostgreSQL TIMESTAMP WITH TIME ZONE is 64-bit int Instant since > >>> the > >>>>> epoch. > >>>>>>>> It has no timezone. > >>>>>>>> > >>>>>>>> MySQL/MariaDB/BigTable/[your fork here] TIMESTAMP is also an int > >>>>> Instant > >>>>>>>> since the epoch. It has no timezone. > >>>>>>>> > >>>>>>>> TIMESTAMP *WITHOUT* TIME ZONE is indeed akin to Numpy "naive > >>>>> datetime" in > >>>>>>>> *function*, but not in implementation: > >>>>>>>> > >>>>>>>> - MySQL DATETIME > >>>>>>>> < > >>>>>>>> > >>>>> > >>> https://dev.mysql.com/doc/internals/en/date-and-time-data-type-representation.html > >>>>>>>>> > >>>>>>>> is weird: 1-bit sign, 17-bit month, 5-bit day, .... > >>>>>>>> - MSSQL > >>>>>>>> < > >>>>>>>> > >>>>> > >>> https://docs.microsoft.com/en-us/sql/t-sql/data-types/datetime2-transact-sql?view=sql-server-ver15 > >>>>>>>>> > >>>>>>>> uses 6, 7 or 8 bytes > >>>>>>>> - PostgreSQL stores an integer, but I think its epoch is still > >>>>>>>> different > >>>>>>>> < > >>>>>>>> > >>>>> > >>> https://github.com/postgres/postgres/blob/c30f54ad732ca5c8762bb68bbe0f51de9137dd72/src/include/datatype/timestamp.h > >>>>>>>>> > >>>>>>>> (it > >>>>>>>> used to store doubles since 2000-01-01) > >>>>>>>> > >>>>>>>> ... so in general, moving datetimes from these systems into 64-bit > >>>>>>>> integers > >>>>>>>> is nontrivial and lossy. > >>>>>>>> > >>>>>>>> Spark / Databricks discusses how Spark handles this > >>>>>>>>> > >>>>>>>>> > >>>>>>>> > >>>>> > >>> https://docs.databricks.com/spark/latest/dataframes-datasets/dates-timestamps.html#ansi-sql-and-spark-sql-timestamps > >>>>>>>>> * WITHOUT TIME ZONE: "These timestamps are not bound to any time > >>>>> zone, > >>>>>>>>> and are wall clock timestamps." — not UTC-normalized > >>>>>>>>> * WITH TIME ZONE: "does not affect the physical point in time > >>> that > >>>>> the > >>>>>>>>> timestamp represents, as that is fully represented by the UTC > >>> time > >>>>>>>>> instant given by the other timestamp components" > >>>>>>>>> > >>>>>>>> > >>>>>>>> I don't use Spark, but I read that page twice. First reading, I > >>> got > >>>>> the > >>>>>>>> same thing out of it. But the second time I read it, I read the > >>>>> opposite! > >>>>>>>> > >>>>>>>> The key part is: "*Spark SQL defines the timestamp type as > >>> TIMESTAMP > >>>>> WITH > >>>>>>>> SESSION TIME ZONE*," -- in other words, Spark doesn't have a > >>> TIMESTAMP > >>>>>>>> WITH > >>>>>>>> TIME ZONE *or* TIMESTAMP WITHOUT TIME ZONE. It has just one > >>> Timestamp > >>>>>>>> type: > >>>>>>>> a 64-bit Instant since the epoch. (It also has a Date type.) > >>>>>>>> > >>>>>>>> If I'm reading correctly, this is exactly the same as PostgreSQL > >>>>> TIMESTAMP > >>>>>>>> WITH TIME ZONE and MySQL TIMESTAMP: servers always store and > >>> transmit > >>>>>>>> timestamps as bare 64-bit integers since the epoch -- without > >>>>> timezone. > >>>>>>>> > >>>>>>>> > >>>>>>>>> If we were to change Arrow's "WITHOUT TIMEZONE" semantics to be > >>>>>>>>> interpreted as UTC-normalized, that would force all of these > >>> other > >>>>>>>>> systems (and more) to serialize their data to be UTC-normalized > >>>>>>>> > >>>>>>>> > >>>>>>>> Are those systems' 64-bit integers interoperable in the first > >>> place? > >>>>>>>> > >>>>>>>> As I understand it, there's a ton of variance out there when > >>> encoding > >>>>>>>> datetimes as 64-bit integers. The Numpy+Arrow+Parquet+Postgres > >>>>> encoding > >>>>>>>> is one of many. As I mentioned in another thread, programming > >>>>> languages > >>>>>>>> all > >>>>>>>> use structs. > >>>>>>>> > >>>>>>>> pandas as discussed has non-UTC-normalized WITHOUT TIME ZONE > >>> "naive" > >>>>>>>>> timestamps and UTC-normalized WITH TIME ZONE. > >>>>>>>>> > >>>>>>>>> If we were to change Arrow's "WITHOUT TIMEZONE" semantics to be > >>>>>>>>> interpreted as UTC-normalized, that would force all of these > >>> other > >>>>>>>>> systems (and more) to serialize their data to be UTC-normalized > >>>>> (i.e. > >>>>>>>>> calling the equivalent of pandas's tz_localize function) when > >>> they > >>>>>>>>> convert to Arrow. > >>>>>>>> > >>>>>>>> > >>>>>>>> Alternatives: > >>>>>>>> > >>>>>>>> - int64 > >>>>>>>> - date32+time64 > >>>>>>>> - date32+time32 > >>>>>>>> > >>>>>>>> This seems very harmful to me, and will make data > >>>>>>>>> from these systems not accurately representable in Arrow and > >>> unable > >>>>> to > >>>>>>>>> be round-tripped. > >>>>>>>>> > >>>>>>>> > >>>>>>>> Certainly nobody wants to go backwards. > >>>>>>>> > >>>>>>>> We need to clarify: how do we store these *common* types -- MySQL > >>>>>>>> TIMESTAMP, Spark TIMESTAMP, PostgreSQL TIMESTAMP WITH TIME ZONE > >>> -- in > >>>>>>>> Arrow? > >>>>>>>> > >>>>>>>> Secondarily, I think: how do we recommend users store *datetimes* > >>> in > >>>>>>>> Arrow? > >>>>>>>> (I'd expect this to be messier, since every system/language uses a > >>>>>>>> different byte structure.) > >>>>>>>> > >>>>>>>> Perhaps we can make a spreadsheet and look comprehensively at how > >>> many > >>>>>>>>> use cases would be disenfranchised by requiring UTC > >>> normalization > >>>>>>>>> always. > >>>>>>>> > >>>>>>>> > >>>>>>>> Hear, hear! > >>>>>>>> > >>>>>>>> Can we also poll people to find out how they're storing Instants > >>>>> today? > >>>>>>>> > >>>>>>>> Enjoy life, > >>>>>>>> Adam > >>>>>>>> > >>>>>>>> -- > >>>>>>>> Adam Hooper > >>>>>>>> +1-514-882-9694 > >>>>>>>> http://adamhooper.com > >>>>>>>> > >>>>>>> > >>>>> > >>> > >> > >> > >> -- > >> Adam Hooper > >> +1-514-882-9694 > >> http://adamhooper.com >