jkosh44 commented on issue #16285:
URL: https://github.com/apache/datafusion/issues/16285#issuecomment-2950146627

   The query that fails looks like
   ```sql
   create table foo (val int, ts1 timestamp, ts2 timestamp, i interval) ...
   SELECT val, ts1 - ts2 FROM foo ORDER BY ts2 - ts1;
   ```
   
   Subtracting two timestamps returns a [duration 
type](https://arrow.apache.org/docs/format/Columnar.html#data-types).
   
   ```sql
   > SELECT arrow_typeof(ts1 - ts2) FROM foo LIMIT 1;
   +---------------------------------+
   | arrow_typeof(foo.ts1 - foo.ts2) |
   +---------------------------------+
   | Duration(Nanosecond)            |
   +---------------------------------+
   1 row(s) fetched. 
   Elapsed 0.005 seconds.
   ```
   
   There is no duration type in substrait 
(https://substrait.io/types/type_classes/), so the cast from arrow to substrate 
fails: 
https://github.com/apache/datafusion/blob/4cf1148b3679127871537a555ecc922735a78558/datafusion/substrait/src/logical_plan/producer/types.rs#L30-L289.
   
   Interestingly, Duration doesn't really seem to be a fully supported data 
type by Datafusion. It's not listed as a supported SQL datatype: 
https://datafusion.apache.org/user-guide/sql/data_types.html#date-time-types. 
You can't directly create a table with a duration column type:
   ```SQL
   CREATE TABLE t (a duration);
   This feature is not implemented: Unsupported SQL type 
Custom(ObjectName([Identifier(Ident { value: "duration", quote_style: None, 
span: Span(Location(1,19)..Location(1,27)) })]), [])
   ```
   but you can indirectly create a table with a duration column type
   ```SQL
   > CREATE TABLE t AS SELECT '2023-03-15T15:00:20.000000123'::timestamp - 
'2023-01-20T23:00:00.000000099'::timestamp;
   0 row(s) fetched. 
   Elapsed 0.009 seconds.
   
   > SHOW COLUMNS FROM t;
   
+---------------+--------------+------------+-------------------------------------------------------------------------------+----------------------+-------------+
   | table_catalog | table_schema | table_name | column_name                    
                                               | data_type            | 
is_nullable |
   
+---------------+--------------+------------+-------------------------------------------------------------------------------+----------------------+-------------+
   | datafusion    | public       | t          | 
Utf8("2023-03-15T15:00:20.000000123") - Utf8("2023-01-20T23:00:00.000000099") | 
Duration(Nanosecond) | NO          |
   
+---------------+--------------+------------+-------------------------------------------------------------------------------+----------------------+-------------+
   1 row(s) fetched. 
   Elapsed 0.010 seconds.
   
   ```
   and as shown above, some expressions return a duration.
   
   One solution would be for Datafusion to stop using Duration and stick to 
Interval. There has been recent work to add more support for Durations, so I 
don't think this is desirable:
   
     - https://github.com/apache/datafusion/issues/15458
     - https://github.com/apache/datafusion/pull/15310
     - https://github.com/apache/datafusion/pull/15322
   
   The other solution would be to add suport for casting Arrow Duration types 
to Substrait. There's probably multiple approaches to this:
   
     - Use `INTERVAL_COMPOUND<P>`. This would be a lossy conversion, because 
the round trip would look like Arrow Duration -> Substrait Interval -> Arrow 
Interval.
     - Use some type of user defined type, `STRUCT`, or `NSTRUCT`.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Reply via email to