andygrove opened a new issue, #4654:
URL: https://github.com/apache/datafusion-comet/issues/4654

   ## Background
   
   Comet has a `CodegenDispatchFallback` mechanism for expression serdes that 
have a native implementation which is `Incompatible` with Spark for some (or 
all) inputs. When such an expression reports `Incompatible` and the user has 
not enabled that expression's `allowIncompatible`, it routes through the JVM 
codegen dispatcher (Spark's own `doGenCode` compiled into an Arrow-direct batch 
kernel) instead of falling the operator back to Spark. The dispatcher is on by 
default (`spark.comet.exec.scalaUDF.codegen.enabled=true`).
   
   A consequence of this design: for these expressions the incompatible native 
code is **only reachable when `allowIncompatible=true`**. By default everyone 
already runs through the (correct) codegen dispatcher. So for the "always 
incompatible" expressions, the native impl exists purely to give opt-in users a 
faster-but-wrong result instead of a correct dispatcher result that is already 
near-native speed.
   
   This issue is to discuss whether we should remove some of these incompatible 
native implementations entirely and rely on codegen dispatch.
   
   ## Survey
   
   There are currently 14 serdes mixing in `CodegenDispatchFallback`. They 
split into two groups.
   
   ### Group A: always `Incompatible` (native path is opt-in / wrong-by-default)
   
   `getSupportLevel` returns `Incompatible` for every non-error input. Default 
users always route through codegen dispatch; the native impl only runs under 
`allowIncompatible`.
   
   | Expression | File | Reason for incompatibility |
   |---|---|---|
   | `CometFromUTCTimestamp` | `serde/datetime.scala` | DataFusion 
timezone-string parsing diverges from Spark (e.g. `GMT+1`, `PST` throw a native 
parse error) — #2013 |
   | `CometToUTCTimestamp` | `serde/datetime.scala` | same timezone-parse 
divergence — #2013 |
   | `CometConvertTimezone` | `serde/datetime.scala` | same timezone-parse 
divergence — #2013 |
   | `CometFromUnixTime` | `serde/unixtime.scala` | only supports the default 
pattern; DataFusion's valid timestamp range differs from Spark — 
apache/datafusion#16594 |
   | `CometArrayExcept` | `serde/arrays.scala` | null handling and ordering may 
differ from Spark |
   | `CometArrayJoin` | `serde/arrays.scala` | null handling may differ from 
Spark |
   | `CometArrayIntersect` | `serde/arrays.scala` | result element order may 
differ when the right array is longer (DataFusion probes the longer side) |
   
   ### Group B: conditionally `Incompatible` (native path is `Compatible` for 
the common case)
   
   The native impl is correct and fast for the majority case; only a narrow 
edge case reports `Incompatible` and routes to dispatch. The native code is not 
separable from a "compatible impl" — it is the same code serving both.
   
   | Expression | File | Compatible case / incompatible edge |
   |---|---|---|
   | `CometHour` | `serde/datetime.scala` | compatible for all types except 
`TimestampNTZ` — #3180 |
   | `CometMinute` | `serde/datetime.scala` | compatible for all types except 
`TimestampNTZ` — #3180 |
   | `CometSecond` | `serde/datetime.scala` | compatible for all types except 
`TimestampNTZ` — #3180 |
   | `CometTruncDate` | `serde/datetime.scala` | compatible for literal 
supported formats; incompatible only for non-literal format strings |
   | `CometTruncTimestamp` | `serde/datetime.scala` | compatible for UTC + 
literal format; incompatible for non-UTC timezones (#2649) or non-literal 
format strings |
   | `CometReverse` | `serde/collectionOperations.scala` | compatible except 
for non-default string collation — #2190 |
   | `CometMapFromEntries` | `serde/maps.scala` | compatible except for 
`BinaryType` keys/values |
   
   ## Discussion
   
   For **Group B**, there is nothing meaningful to remove: the native code is 
the compatible fast path for the dominant case, and the `Incompatible` branch 
only routes a rare edge to the dispatcher, which is the desired behavior.
   
   For **Group A**, removing the incompatible native impls (converting them to 
plain codegen dispatch) would:
   
   - change nothing for default users (already dispatching),
   - remove a correctness footgun, since `allowIncompatible` would no longer 
silently produce wrong timezone/array results,
   - allow deleting the dedicated native code where it is not shared with other 
expressions.
   
   Costs / things to weigh:
   
   - With the dispatcher disabled 
(`spark.comet.exec.scalaUDF.codegen.enabled=false`), these would fall all the 
way back to Spark instead of running the opt-in native path.
   - Some Group A natives are shared DataFusion/Comet scalar functions 
(`array_except`, `array_intersect`, `array_to_string`, `from_utc_timestamp`, 
`to_utc_timestamp`) used elsewhere, so removal saves serde code but not 
necessarily much Rust.
   
   ### Questions for the community
   
   1. Should we remove the Group A incompatible native impls and rely on 
codegen dispatch?
   2. Is there a real-world use case for keeping the opt-in incompatible native 
path (faster-but-wrong via `allowIncompatible`) for any of these?
   3. Should we keep Group B exactly as-is?
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to