Re: [DISCUSS] Data Type framework

Dongjoon Hyun Thu, 11 Sep 2025 08:02:18 -0700

Thank you for sharing the direction, Max.

Since this is internal refactoring, can we do this migration safely in a 
step-by-step manner over multiple Apache Spark versions without blocking any 
Apache Spark releases?


The proposed direction itself looks reasonable and doable for me.

Thanks,
Dongjoon.

On 2025/09/10 13:44:45 "serge rielau.com" wrote:
> I think this is a great idea. There is a signifcant backlog of types which 
> should be added: E.g TIMESTAMP(9), TIMESTAMP WITH TIME ZONE, TIME WITH 
> TIMEZONE, some sort of big decimal to name a few).
> Making these more "plug and play" is goodness.
> 
> +1
> 
> On Sep 10, 2025, at 1:22 PM, Max Gekk <[email protected]> wrote:
> 
> Hi All,
> 
> I would like to propose refactoring of internal operations over Catalyst's 
> data types. In the current implementation, data types are handled in an adhoc 
> manner, and processing logic is dispersed  across the entire code base. There 
> are more than 100 places where every data type is pattern matched. For 
> example, formatting of type values (converting to strings) is implemented in 
> the same way in ToStringBase and in toString (literals.scala). This leads to 
> a few issues:
> 
> 1. If you change the handling in one place, you might miss other places. The 
> compiler won't help you in such cases.
> 2. Adding a new data type has constant and significant overhead. Based on our 
> experience of adding new data types: ANSI intervals 
> (https://issues.apache.org/jira/browse/SPARK-27790) took > 1.5 years, 
> TIMESTAMP_NTZ (https://issues.apache.org/jira/browse/SPARK-35662) took > 1 
> year, TIME (https://issues.apache.org/jira/browse/SPARK-51162) has not been 
> finished yet, but we spent more than half-year so far.
> 
> I propose to define a set of interfaces, and operation classes for every data 
> type. The operation classes (Ops) should implement subsets of interfaces that 
> are suitable for a particular data type.
> For example, TimeType will have the companion class TimeTypeOps which 
> implements the following operations:
> - Operations over the underlying physical type
> - Literal related operationsig decimal (like DECFLOAT(
> - Formatting of type values to strings
> - Converting to/from external Java type: java.time.LocalTime in the case of 
> TimeType
> - Hashing data type values
> 
> On the handling side, we won't need to examine every data type. We can check 
> that a data type and its ops instance supports a required interface, and 
> invoke the needed method. For example:
> ---
>   override def sql: String = dataTypeOps match {
>     case fops: FormatTypeOps => fops.toSQLValue(value)
>     case _ => value.toString
>   }
> ---
> Here is the prototype of the proposal: 
> https://github.com/apache/spark/pull/51467
> 
> Your comments and feedback would be greatly appreciated.
> 
> Yours faithfully,
> Max Gekk
> 
> 

---------------------------------------------------------------------
To unsubscribe e-mail: [email protected]

Re: [DISCUSS] Data Type framework

Reply via email to