Re: [PR] fix: array expression audit follow-ups (#4503) [datafusion-comet]

via GitHub Wed, 01 Jul 2026 11:51:14 -0700


andygrove commented on code in PR #4713:
URL: https://github.com/apache/datafusion-comet/pull/4713#discussion_r3508354817



##########
spark/src/main/scala/org/apache/comet/serde/arrays.scala:
##########
@@ -377,13 +386,31 @@ object CometArrayExcept
   }
 }
 
-object CometArrayJoin extends CometExpressionSerde[ArrayJoin] with 
CodegenDispatchFallback {
+object CometArrayJoin
+    extends CometExpressionSerde[ArrayJoin]
+    with CometTypeShim
+    with CodegenDispatchFallback {
 
   private val incompatReason = "Null handling may differ from Spark"
 
+  private val unsupportedCollationReason =
+    "array_join on collated strings is not supported " +
+      "(https://github.com/apache/datafusion-comet/issues/2190)"
+
   override def getIncompatibleReasons(): Seq[String] = Seq(incompatReason)
 
-  override def getSupportLevel(expr: ArrayJoin): SupportLevel = 
Incompatible(Some(incompatReason))
+  override def getUnsupportedReasons(): Seq[String] = 
Seq(unsupportedCollationReason)
+
+  override def getSupportLevel(expr: ArrayJoin): SupportLevel = {
+    // Spark 4.0 widens ArrayJoin's input to StringTypeWithCollation; the 
native array_to_string
+    // produces UTF8_BINARY semantics and does not propagate non-default 
collations, so surface
+    // that as a fallback reason in EXPLAIN, mirroring CometArrayIntersect.
+    if (hasNonDefaultStringCollation(expr.array.dataType)) {
+      Unsupported(Some(unsupportedCollationReason))

Review Comment:
   Good call, reconciled all three. I pushed 41fc083.
   
   `Unsupported` here was just following `array_intersect`, not a real 
limitation. The dispatcher accepts collated `StringType` and `ArrayType` inputs 
(and exempts the `ResolvedCollation` `Unevaluable`), so I flipped both 
`array_join` and `array_intersect` collation cases to `Incompatible`. They now 
route through the dispatcher and stay native, lining up with `CometReverse`.
   
   The two differ in why the dispatcher is needed, which I noted in the code:
   
   - `array_join`: concatenation is collation-independent, so the joined value 
is always correct; only the output string's collation metadata is dropped.
   - `array_intersect`: membership is collation-aware, so the byte-based native 
path is genuinely value-wrong. The dispatcher (Spark's own `doGenCode`) is what 
produces the correct set.
   
   Added `array_join_collation.sql` (flipped from `expect_fallback` to native 
`query`) and a new `array_intersect_collation.sql` exercising `UTF8_LCASE` 
dedup (e.g. `'A'` matching `'a'`); both assert native execution and match 
Spark. A vacuous byte-dedup fallback would return an empty array and fail 
against Spark, so the intersect test confirms the dispatched path really ran.
   
   On the minor point: left the check as `expr.array.dataType` since it is 
equivalent and reads as "is the input array's element type collated", but happy 
to switch to `expr.dataType` to mirror the sibling if you prefer.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] fix: array expression audit follow-ups (#4503) [datafusion-comet]

Reply via email to