Neither PostgreSQL, SQL Server nor Oracle seem to support leap seconds at all.

So it seems perhaps the Arrow format should not support them either. However, at some IO boundaries (such as when converting from CSV or JSON), we may want to "coerce" leap seconds (which probably means turning the 60 into a 59, and turning 24:00:00 into 23:59:59).

(this is also what Python does:
https://github.com/python/cpython/blob/main/Modules/_datetimemodule.c#L4978-L4984)


Le 16/08/2021 à 23:13, Antoine Pitrou a écrit :

PS : need to check what databases do / allow, as well


Le 16/08/2021 à 23:12, Antoine Pitrou a écrit :

POSIX allows for a single leap second:
https://pubs.opengroup.org/onlinepubs/9699919799/basedefs/time.h.html

The Windows API does not seem to know about leap seconds:
https://docs.microsoft.com/en-us/windows/win32/api/minwinbase/ns-minwinbase-systemtime

The standard Python type `datetime.time` does not allow for leap seconds
(*):
https://docs.python.org/3/library/datetime.html#available-types

((*) """An idealized time, independent of any particular day, assuming
that every day has exactly 24*60*60 seconds. (There is no notion of
“leap seconds” here.) """)


I think it would be reasonable to allow for a single leap second in
Arrow, meaning the interval [0, 84600].

It would also put the nail in the coffin of Weston's approach (B) (you
can't represent leap seconds if you mandate that values are interpreted
modulo 86400, and interpreting them modulo 86401 would be utterly weird).

Regards

Antoine.


Le 16/08/2021 à 22:23, Neal Richardson a écrit :
At the risk of opening a can of worms, isn't it possible that a time could
exceed 24 hours? Like, when there are leap seconds added?

Some experiments inspired by an SO post[1] led me to question the meaning
of time.

Looks like the arrow mailing list is taking a philosophical turn :)

Neal

On Mon, Aug 16, 2021 at 3:05 PM Antoine Pitrou <anto...@python.org> wrote:


Le 16/08/2021 à 20:52, Weston Pace a écrit :
Some experiments inspired by an SO post[1] led me to question the
meaning of time.  The main question is **what happens when the value
exceeds 24 hours?**.

     A) One potential interpretation is that these are invalid but neither
the C++ implementation or pyarrow reject these today.  Nor do they
correct them.
     B) An alternative interpretation is to modulo by UTC days (e.g., if
seconds, 86400) and use the resulting value.

The (B) approach makes conversion from timestamp -> time trivial (just
a metadata change).  I think this is the correct, and preferred,
interpretation.  However, it would require all implementations to
interpret time in this way.  With that in mind, if we think this is
the correct approach, I'd like to clean up the docs.

(B) doesn't make sense at all to me.  Really, (A) is the only reasonable
interpretation.

We don't check data at IO boundaries by default, since that would be
expensive (for example, we don't check for valid UTF8).  However, see
https://issues.apache.org/jira/browse/ARROW-10924 for explicit temporal
data validation in C++.

Regards

Antoine.


Reply via email to