Hi Arrow Friends,

On the Arrow call today, I brought up an issue of the TIME type not supporting 
24:00:00.

I'm doing compatibility checking against various databases (but mostly DuckDB) 
and this complicates the ability to round-trip data through IPC and back while 
preserving full fidelity at the edges of the time type.  If a database contains 
a value of `24:00:00` and that value cannot be represented in Arrow, 
serializing to IPC and reading it back will necessarily lose information. My 
bias here is simple: users should be able to move their data through Arrow 
without alteration.

I'm thinking we might want to revisit the decision to exclude 24:00:00. Because 
if databases have data at 24:00:00 (i.e. an event happened at a leap second) 
that information would be lost if they serialized it to Arrow then read the 
data back.  

24:00:00 is supported by these databases: PostgreSQL, DB2, Cockroach, Redshift, 
DuckDB, MySQL, MariaDB, TiDB, ClickHouse

It is not supported by: SQL Server, Snowflake, BigQuery, Firebird, H2. 

You can find my analysis and links to the docs here: 
https://rusty.today/blog/time-data-type-compatibility-across-databases/

There have been previous discussions (and votes) about this ~2021: 

"Clarifying interpretation of Time32/Time64 past 24 hours" - 
https://lists.apache.org/thread/ckms85lx5649z84jqhtowl7zvr71kxkr

And there was a vote to exclude 24:00:00.

I don’t have a strong opinion about leap seconds specifically, but I do care 
deeply about data fidelity and avoiding avoidable precision loss in IPC. Given 
the real-world database landscape, it may be time to reconsider whether 
excluding `24:00:00` best serves us all now.

Cheers,

Rusty
https://query.farm

Reply via email to