findepi opened a new issue, #12644:
URL: https://github.com/apache/datafusion/issues/12644

   ### Is your feature request related to a problem or challenge?
   
   Currently DataFusion provides a lot of built-in types which are useful when 
building applications / query engines on top of DataFusion. However, even 
plethora of types is not enough. DataFusion doesn't have types existing in 
other systems, limiting DataFusion applicability as "LLVM for query engines"
   
   For example, these types commonly found in other systems do not exist today
   - char(n)
   - varchar(n)
   - timestamp with time zone (a pair of "point in time" + "time zone" 
information; found in Oracle, Trino, Snowflake, etc.)
     - DataFusion currently uses Arrow `DataType` and the closest Arrow has is 
"timestamp(zone)" where each value is in _same_ zone
   - timestamp with local time zone (point in time without zone information; 
found in Spark, Hive, PostgreSQL)
     - DataFusion currently uses Arrow `DataType` and the closest Arrow has is 
"timestamp(zone)" with eg UTC zone. however cast to varchar for 
"timestamp(UTC)" and for "timestamp with local time zone" should behave 
differently
   - time with time zone 
   - JSON
     - DataFusion currently uses Arrow `DataType` and the closest Arrow has 
Utf8 potentially with some metadata information. Utf8 might be a perfect 
carrier type for JSON data, but "cast(json AS T)" and "cast(utf8 AS T)" are 
usually pretty different operations
   - VARIANT (https://github.com/apache/datafusion/issues/10987)
   - geospatial Geometry types 
(https://github.com/apache/datafusion/issues/7859)
   - HLL (hyperloglog), digests (t-digest, q-digest, other statistical digests)
   - extensions for applications building on top of DF; including user defined 
types (UDT) (https://github.com/apache/datafusion/issues/7923)
     - ability to provide _user_-defined types is even broader than ability to 
provide extension types ("_rust_-defined types")
   
   
   
   
   ### Describe the solution you'd like
   
   1. Introduction of DataFusion own type system
      - this is generally covered by 
https://github.com/apache/datafusion/issues/11513 & 
https://github.com/apache/datafusion/issues/12622
   2. Introduction of extensions in DataFusion type system allowing 
applications building on DataFusion to provide more types 
      - the extension types -- not unlike DataFusion built-in types -- need to 
use Arrow types as "carrier type" for transporting
      - the Arrow type metadata weaved into schema _fields_ can be used to 
indicate use of extension types to the client, when data is returned to the 
user in Arrow form
      - for example, a "timestamp with time zone" type _could_ be represented 
as Struct with two fields: point_in_time, time_zone
   4. Ability to dynamically find operations on types during function 
resolution or runtime
      - for example a `CAST(array<T> AS varchar)` needs to know how to do 
`cast(T AS varchar)`. It cannot delegate this logic fully to Arrow, because 
Arrow won't have a notion of extension types.
        - eg if "timestamp with time zone" uses a Struct as a carrier type, it 
still needs to define its own `cast(... AS varchar)`. It cannot use the default 
`cast(struct AS varchar)`.
   
   
   ### Describe alternatives you've considered
   
   #### Everything is built-in
   
   DataFusion could provide all types needed by applications building on top of 
DataFusion as built-in DataFusion types.
   This would be easiest to implement, but could lead to scope-creep for the 
project. This could also lead to conflicts where types look the same but the 
desired behavior differs between applications building on top of DataFusion. 
For example Oracle's and Trino's "timestamp with time zone" can represent 
political zones while Snowflake's allows only fixed offsets.
   
   #### No-op
   
   Not providing extension types. This would limit DataFusion applicability.
   DataFusion cannot be considered "LLVM for query engines" if it cannot serve 
as an engine, or potential engine, for existing popular query engines. 
   
   ### Additional context
   
   The need to create extension types was raised in the [Proposal] Decouple 
logical from physical types
   
   - https://github.com/apache/datafusion/issues/11513 & 
https://github.com/apache/datafusion/issues/12622
   
   However introduction of DataFusion own types does not require introduction 
of extension types.
   Extension types are complex enough (especially given their impact on 
functions) that they deserve their own roadmap issue.
   
   
   The impact of extension types on functions, functions runtime and resolution 
is very clear, so this relates to Simple Functions initiative:
   
   - https://github.com/apache/datafusion/issues/12635
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Reply via email to