andygrove opened a new issue, #4654: URL: https://github.com/apache/datafusion-comet/issues/4654
## Background Comet has a `CodegenDispatchFallback` mechanism for expression serdes that have a native implementation which is `Incompatible` with Spark for some (or all) inputs. When such an expression reports `Incompatible` and the user has not enabled that expression's `allowIncompatible`, it routes through the JVM codegen dispatcher (Spark's own `doGenCode` compiled into an Arrow-direct batch kernel) instead of falling the operator back to Spark. The dispatcher is on by default (`spark.comet.exec.scalaUDF.codegen.enabled=true`). A consequence of this design: for these expressions the incompatible native code is **only reachable when `allowIncompatible=true`**. By default everyone already runs through the (correct) codegen dispatcher. So for the "always incompatible" expressions, the native impl exists purely to give opt-in users a faster-but-wrong result instead of a correct dispatcher result that is already near-native speed. This issue is to discuss whether we should remove some of these incompatible native implementations entirely and rely on codegen dispatch. ## Survey There are currently 14 serdes mixing in `CodegenDispatchFallback`. They split into two groups. ### Group A: always `Incompatible` (native path is opt-in / wrong-by-default) `getSupportLevel` returns `Incompatible` for every non-error input. Default users always route through codegen dispatch; the native impl only runs under `allowIncompatible`. | Expression | File | Reason for incompatibility | |---|---|---| | `CometFromUTCTimestamp` | `serde/datetime.scala` | DataFusion timezone-string parsing diverges from Spark (e.g. `GMT+1`, `PST` throw a native parse error) — #2013 | | `CometToUTCTimestamp` | `serde/datetime.scala` | same timezone-parse divergence — #2013 | | `CometConvertTimezone` | `serde/datetime.scala` | same timezone-parse divergence — #2013 | | `CometFromUnixTime` | `serde/unixtime.scala` | only supports the default pattern; DataFusion's valid timestamp range differs from Spark — apache/datafusion#16594 | | `CometArrayExcept` | `serde/arrays.scala` | null handling and ordering may differ from Spark | | `CometArrayJoin` | `serde/arrays.scala` | null handling may differ from Spark | | `CometArrayIntersect` | `serde/arrays.scala` | result element order may differ when the right array is longer (DataFusion probes the longer side) | ### Group B: conditionally `Incompatible` (native path is `Compatible` for the common case) The native impl is correct and fast for the majority case; only a narrow edge case reports `Incompatible` and routes to dispatch. The native code is not separable from a "compatible impl" — it is the same code serving both. | Expression | File | Compatible case / incompatible edge | |---|---|---| | `CometHour` | `serde/datetime.scala` | compatible for all types except `TimestampNTZ` — #3180 | | `CometMinute` | `serde/datetime.scala` | compatible for all types except `TimestampNTZ` — #3180 | | `CometSecond` | `serde/datetime.scala` | compatible for all types except `TimestampNTZ` — #3180 | | `CometTruncDate` | `serde/datetime.scala` | compatible for literal supported formats; incompatible only for non-literal format strings | | `CometTruncTimestamp` | `serde/datetime.scala` | compatible for UTC + literal format; incompatible for non-UTC timezones (#2649) or non-literal format strings | | `CometReverse` | `serde/collectionOperations.scala` | compatible except for non-default string collation — #2190 | | `CometMapFromEntries` | `serde/maps.scala` | compatible except for `BinaryType` keys/values | ## Discussion For **Group B**, there is nothing meaningful to remove: the native code is the compatible fast path for the dominant case, and the `Incompatible` branch only routes a rare edge to the dispatcher, which is the desired behavior. For **Group A**, removing the incompatible native impls (converting them to plain codegen dispatch) would: - change nothing for default users (already dispatching), - remove a correctness footgun, since `allowIncompatible` would no longer silently produce wrong timezone/array results, - allow deleting the dedicated native code where it is not shared with other expressions. Costs / things to weigh: - With the dispatcher disabled (`spark.comet.exec.scalaUDF.codegen.enabled=false`), these would fall all the way back to Spark instead of running the opt-in native path. - Some Group A natives are shared DataFusion/Comet scalar functions (`array_except`, `array_intersect`, `array_to_string`, `from_utc_timestamp`, `to_utc_timestamp`) used elsewhere, so removal saves serde code but not necessarily much Rust. ### Questions for the community 1. Should we remove the Group A incompatible native impls and rely on codegen dispatch? 2. Is there a real-world use case for keeping the opt-in incompatible native path (faster-but-wrong via `allowIncompatible`) for any of these? 3. Should we keep Group B exactly as-is? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
