andygrove commented on code in PR #4713:
URL: https://github.com/apache/datafusion-comet/pull/4713#discussion_r3508354817
##########
spark/src/main/scala/org/apache/comet/serde/arrays.scala:
##########
@@ -377,13 +386,31 @@ object CometArrayExcept
}
}
-object CometArrayJoin extends CometExpressionSerde[ArrayJoin] with
CodegenDispatchFallback {
+object CometArrayJoin
+ extends CometExpressionSerde[ArrayJoin]
+ with CometTypeShim
+ with CodegenDispatchFallback {
private val incompatReason = "Null handling may differ from Spark"
+ private val unsupportedCollationReason =
+ "array_join on collated strings is not supported " +
+ "(https://github.com/apache/datafusion-comet/issues/2190)"
+
override def getIncompatibleReasons(): Seq[String] = Seq(incompatReason)
- override def getSupportLevel(expr: ArrayJoin): SupportLevel =
Incompatible(Some(incompatReason))
+ override def getUnsupportedReasons(): Seq[String] =
Seq(unsupportedCollationReason)
+
+ override def getSupportLevel(expr: ArrayJoin): SupportLevel = {
+ // Spark 4.0 widens ArrayJoin's input to StringTypeWithCollation; the
native array_to_string
+ // produces UTF8_BINARY semantics and does not propagate non-default
collations, so surface
+ // that as a fallback reason in EXPLAIN, mirroring CometArrayIntersect.
+ if (hasNonDefaultStringCollation(expr.array.dataType)) {
+ Unsupported(Some(unsupportedCollationReason))
Review Comment:
Good call, reconciled all three. I pushed 41fc083.
`Unsupported` here was just following `array_intersect`, not a real
limitation. The dispatcher accepts collated `StringType` and `ArrayType` inputs
(and exempts the `ResolvedCollation` `Unevaluable`), so I flipped both
`array_join` and `array_intersect` collation cases to `Incompatible`. They now
route through the dispatcher and stay native, lining up with `CometReverse`.
The two differ in why the dispatcher is needed, which I noted in the code:
- `array_join`: concatenation is collation-independent, so the joined value
is always correct; only the output string's collation metadata is dropped.
- `array_intersect`: membership is collation-aware, so the byte-based native
path is genuinely value-wrong. The dispatcher (Spark's own `doGenCode`) is what
produces the correct set.
Added `array_join_collation.sql` (flipped from `expect_fallback` to native
`query`) and a new `array_intersect_collation.sql` exercising `UTF8_LCASE`
dedup (e.g. `'A'` matching `'a'`); both assert native execution and match
Spark. A vacuous byte-dedup fallback would return an empty array and fail
against Spark, so the intersect test confirms the dispatched path really ran.
On the minor point: left the check as `expr.array.dataType` since it is
equivalent and reads as "is the input array's element type collated", but happy
to switch to `expr.dataType` to mirror the sibling if you prefer.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]