Thank you for sharing the direction, Max. Since this is internal refactoring, can we do this migration safely in a step-by-step manner over multiple Apache Spark versions without blocking any Apache Spark releases?
The proposed direction itself looks reasonable and doable for me. Thanks, Dongjoon. On 2025/09/10 13:44:45 "serge rielau.com" wrote: > I think this is a great idea. There is a signifcant backlog of types which > should be added: E.g TIMESTAMP(9), TIMESTAMP WITH TIME ZONE, TIME WITH > TIMEZONE, some sort of big decimal to name a few). > Making these more "plug and play" is goodness. > > +1 > > On Sep 10, 2025, at 1:22 PM, Max Gekk <[email protected]> wrote: > > Hi All, > > I would like to propose refactoring of internal operations over Catalyst's > data types. In the current implementation, data types are handled in an adhoc > manner, and processing logic is dispersed across the entire code base. There > are more than 100 places where every data type is pattern matched. For > example, formatting of type values (converting to strings) is implemented in > the same way in ToStringBase and in toString (literals.scala). This leads to > a few issues: > > 1. If you change the handling in one place, you might miss other places. The > compiler won't help you in such cases. > 2. Adding a new data type has constant and significant overhead. Based on our > experience of adding new data types: ANSI intervals > (https://issues.apache.org/jira/browse/SPARK-27790) took > 1.5 years, > TIMESTAMP_NTZ (https://issues.apache.org/jira/browse/SPARK-35662) took > 1 > year, TIME (https://issues.apache.org/jira/browse/SPARK-51162) has not been > finished yet, but we spent more than half-year so far. > > I propose to define a set of interfaces, and operation classes for every data > type. The operation classes (Ops) should implement subsets of interfaces that > are suitable for a particular data type. > For example, TimeType will have the companion class TimeTypeOps which > implements the following operations: > - Operations over the underlying physical type > - Literal related operationsig decimal (like DECFLOAT( > - Formatting of type values to strings > - Converting to/from external Java type: java.time.LocalTime in the case of > TimeType > - Hashing data type values > > On the handling side, we won't need to examine every data type. We can check > that a data type and its ops instance supports a required interface, and > invoke the needed method. For example: > --- > override def sql: String = dataTypeOps match { > case fops: FormatTypeOps => fops.toSQLValue(value) > case _ => value.toString > } > --- > Here is the prototype of the proposal: > https://github.com/apache/spark/pull/51467 > > Your comments and feedback would be greatly appreciated. > > Yours faithfully, > Max Gekk > > --------------------------------------------------------------------- To unsubscribe e-mail: [email protected]
