jkosh44 commented on issue #16285: URL: https://github.com/apache/datafusion/issues/16285#issuecomment-2950146627
The query that fails looks like ```sql create table foo (val int, ts1 timestamp, ts2 timestamp, i interval) ... SELECT val, ts1 - ts2 FROM foo ORDER BY ts2 - ts1; ``` Subtracting two timestamps returns a [duration type](https://arrow.apache.org/docs/format/Columnar.html#data-types). ```sql > SELECT arrow_typeof(ts1 - ts2) FROM foo LIMIT 1; +---------------------------------+ | arrow_typeof(foo.ts1 - foo.ts2) | +---------------------------------+ | Duration(Nanosecond) | +---------------------------------+ 1 row(s) fetched. Elapsed 0.005 seconds. ``` There is no duration type in substrait (https://substrait.io/types/type_classes/), so the cast from arrow to substrate fails: https://github.com/apache/datafusion/blob/4cf1148b3679127871537a555ecc922735a78558/datafusion/substrait/src/logical_plan/producer/types.rs#L30-L289. Interestingly, Duration doesn't really seem to be a fully supported data type by Datafusion. It's not listed as a supported SQL datatype: https://datafusion.apache.org/user-guide/sql/data_types.html#date-time-types. You can't directly create a table with a duration column type: ```SQL CREATE TABLE t (a duration); This feature is not implemented: Unsupported SQL type Custom(ObjectName([Identifier(Ident { value: "duration", quote_style: None, span: Span(Location(1,19)..Location(1,27)) })]), []) ``` but you can indirectly create a table with a duration column type ```SQL > CREATE TABLE t AS SELECT '2023-03-15T15:00:20.000000123'::timestamp - '2023-01-20T23:00:00.000000099'::timestamp; 0 row(s) fetched. Elapsed 0.009 seconds. > SHOW COLUMNS FROM t; +---------------+--------------+------------+-------------------------------------------------------------------------------+----------------------+-------------+ | table_catalog | table_schema | table_name | column_name | data_type | is_nullable | +---------------+--------------+------------+-------------------------------------------------------------------------------+----------------------+-------------+ | datafusion | public | t | Utf8("2023-03-15T15:00:20.000000123") - Utf8("2023-01-20T23:00:00.000000099") | Duration(Nanosecond) | NO | +---------------+--------------+------------+-------------------------------------------------------------------------------+----------------------+-------------+ 1 row(s) fetched. Elapsed 0.010 seconds. ``` and as shown above, some expressions return a duration. One solution would be for Datafusion to stop using Duration and stick to Interval. There has been recent work to add more support for Durations, so I don't think this is desirable: - https://github.com/apache/datafusion/issues/15458 - https://github.com/apache/datafusion/pull/15310 - https://github.com/apache/datafusion/pull/15322 The other solution would be to add suport for casting Arrow Duration types to Substrait. There's probably multiple approaches to this: - Use `INTERVAL_COMPOUND<P>`. This would be a lossy conversion, because the round trip would look like Arrow Duration -> Substrait Interval -> Arrow Interval. - Use some type of user defined type, `STRUCT`, or `NSTRUCT`. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org