davidm-db commented on code in PR #54223:
URL: https://github.com/apache/spark/pull/54223#discussion_r2853018332


##########
sql/api/src/main/scala/org/apache/spark/sql/types/ops/TypeApiOps.scala:
##########
@@ -0,0 +1,87 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.types.ops
+
+import org.apache.spark.SparkException
+import org.apache.spark.sql.types.{DataType, TimeType}
+
+/**
+ * Base trait for client-side (spark-api) type operations.
+ *
+ * PURPOSE: TypeApiOps handles operations that require spark-api internals 
(e.g., AgnosticEncoder)
+ * that are not available in the catalyst package. This separation prevents 
circular dependencies
+ * between sql/api and sql/catalyst modules.
+ *
+ * USAGE - examples of use:
+ *   - Row encoding/decoding (EncodeTypeOps)
+ *   - String formatting (FormatTypeOps)
+ *
+ * RELATIONSHIP TO TypeOps:
+ *   - TypeOps (catalyst): Server-side operations - physical types, literals, 
conversions
+ *   - TypeApiOps (spark-api): Client-side operations - encoding, formatting
+ *
+ * For TimeType, TimeTypeOps extends TimeTypeApiOps to inherit both sets of 
operations.
+ *
+ * @see
+ *   TimeTypeApiOps for a reference implementation
+ * @since 4.1.0
+ */
+trait TypeApiOps extends Serializable {
+
+  /** The DataType this Ops instance handles */
+  def dataType: DataType
+}
+
+/**
+ * Factory object for creating TypeApiOps instances.
+ */
+object TypeApiOps {
+
+  /**
+   * Creates a TypeApiOps instance for the given DataType.
+   *
+   * @param dt
+   *   The DataType to get operations for
+   * @return
+   *   TypeApiOps instance for the type
+   * @throws SparkException
+   *   if no TypeApiOps implementation exists for the type
+   */
+  def apply(dt: DataType): TypeApiOps = dt match {
+    case tt: TimeType => new TimeTypeApiOps(tt)
+    // Future types will be added here
+    case _ =>
+      throw SparkException.internalError(
+        s"No TypeApiOps implementation for ${dt.typeName}. " +
+          "This type is not yet supported by the Types Framework.")
+  }
+
+  /**
+   * Checks if a DataType is supported by the Types Framework (client-side).
+   *
+   * @param dt
+   *   The DataType to check
+   * @return
+   *   true if the type is supported by the framework
+   */
+  def supports(dt: DataType): Boolean = dt match {
+    case _: TimeType => true

Review Comment:
   Sorry for a long message...
   
   I'm honestly not exactly sure on what is the best approach here and I think 
there are a couple of options that have different tradeoffs. We still need to 
have semantics of `supports` and `apply`, because that is the pattern that we 
need in a bunch of places throughout the code, e.g:
   ```
   case _ if *TypeOps.supports(dt) => *TypeOps(dt).<func>()
   ```
   
   We need this pattern for two reasons:
   - not all types will go through the framework, only new ones - we need the 
`supports` semantics
   - we want to allow for incremental implementation of Ops interfaces, or in 
some specific cases that cannot be handled by the current state of framework, 
allow the logic to be implemented outside of the framework
   
   Now, we could make this cleaner by doing something like this to enlist into 
a single place:
   ```
   object TypeApiOps {
       def get(dt: DataType): Option[TypeApiOps] = dt match {
         case tt: TimeType => Some(TimeTypeApiOps(tt))
         // single place to register
         case _ => None
       }
   
       def apply(dt: DataType): TypeApiOps = get(dt).getOrElse(
         throw SparkException.internalError(...))
   
       def supports(dt: DataType): Boolean = get(dt).isDefined
     }
   ```
   
   I see three options (trading off between the ease of maintenance and 
efficiency):
   - Current design (each `Ops` class delgates to `TypeOps`), for example:
     - `PhyTypeOps.supports(dt)` -> `TypeOps.supports(dt)` -> 1 pattern match
     - `PhyTypeOps(dt)` -> `TypeOps(dt).asInstanceOf[PhyTypeOps]` -> 1 pattern 
match + 1 allocation + 1 cast
     - Total: 2 matches, 1 allocation, 1 cast
   - Proposed design with `get()`:
     - `PhyTypeOps.supports(dt)` -> `TypeOps.get(dt).collect{...}.isDefined` -> 
1 match + Some + collect/Option
     - `PhyTypeOps(dt)` -> `TypeOps.get(dt).collect{...}.getOrElse(...)` → same 
thing again
     - Total: 2 matches, ~4 allocations
   - Most efficient solution is to have the current design, but each of them 
implements its own `supports` and `apply`:
     - `PhyTypeOps.supports(dt)` -> direct `case _: TimeType => true` -> 1 
pattern match
     - `PhyTypeOps(dt)` -> direct `case tt: TimeType => TimeTypeOps(tt)` -> 1 
pattern match + 1 allocation
     - Total: 2 matches, 1 allocation. No delegation, no Options, no casts.
   
   Also, if we want to support incremental implementation, we might want to 
either:
   - Let each Ops interface implement its own `supports` - overhead of 
enlisting the type in each Ops class.
   - Do something like:
     ```
      object PhyTypeOps {
        def supports(dt: DataType): Boolean = TypeOps.supports(dt) && 
TypeOps(dt).isInstanceOf[PhyTypeOps]
        def apply(dt: DataType): PhyTypeOps = 
TypeOps(dt).asInstanceOf[PhyTypeOps]
     }
     ```
   
   I'm honestly not sure how important this overhead might be. I think your 
input here might be really useful.
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to