helioshe4 commented on code in PR #54297:
URL: https://github.com/apache/spark/pull/54297#discussion_r2820372438


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/collect.scala:
##########
@@ -564,6 +564,81 @@ case class ListAgg(
     false
   }
 
+  /**
+   * Determines whether the order mismatch between [[child]] and 
[[orderExpressions]] is due to
+   * a cast, and if so, whether that cast is safe for DISTINCT deduplication.
+   *
+   * When LISTAGG(DISTINCT) is used with a non-string column, a Cast is 
applied to the
+   * child expression. The DISTINCT rewrite uses GROUP BY on the cast result, 
which can produce
+   * incorrect deduplication for types where equal values cast to different 
strings
+   * (e.g., Float/Double where -0.0 and 0.0 are GROUP BY-equal but cast to 
different strings).
+   *
+   * Safety is determined by both the source type (via 
[[isCastSafeForDistinct]]) and the target
+   * type's collation (via [[isCastTargetSafeForDistinct]]).
+   *
+   * @return `Some(Right(()))` if the mismatch is due to a safe cast,
+   *         `Some(Left((inputType, castType)))` if the cast is unsafe, 
carrying the source and
+   *         target types for use in the error message,
+   *         `None` if the mismatch is not due to a cast at all
+   */
+  def orderMismatchCastSafety: Option[Either[(DataType, DataType), Unit]] = {
+    if (orderExpressions.size != 1) return None
+    child match {
+      case Cast(castChild, castType, _, _)
+        if orderExpressions.head.child.semanticEquals(castChild) =>
+          if (isCastSafeForDistinct(castChild.dataType) && 
isCastTargetSafeForDistinct(castType)) {
+            Some(Right(()))
+          } else {
+            Some(Left((castChild.dataType, castType)))
+          }
+      case _ => None
+    }
+  }
+
+  /**
+   * Checks whether a source type preserves equality semantics after casting 
to STRING/BINARY.
+   *
+   * A type is safe if equal values always produce equal string 
representations and different
+   * string representations always imply different values. Types like 
Float/Double are unsafe
+   * because IEEE 754 negative zero (-0.0) and positive zero (0.0) are equal 
but produce
+   * different string representations.
+   *
+   * @param dt the source [[DataType]] before casting
+   * @return true if the cast preserves equality semantics for DISTINCT 
deduplication
+   * @see [[orderMismatchCastSafety]]
+   */
+  private def isCastSafeForDistinct(dt: DataType): Boolean = dt match {
+    case _: IntegerType | LongType | ShortType | ByteType => true
+    case _: DecimalType => true
+    case _: DateType | TimestampType | TimestampNTZType => true
+    case _: TimeType => true

Review Comment:
   good point.  the `Timestamp` object is internally represented as 
microseconds in epoch UTC, and also holds the Timezone information (set at 
session level).    So when converting to string, the string displays the time 
according to local (session) timezone, but the timezone is not in the actual 
string.
   
   So i believe this causes issues for daylight savings fallback (which I 
validated with a test).  e.g. if 2 timestamps are recorded, one 30 min before 
and one 30 min after DST fallback occurs, their string representation would be 
the same (since the later one gets reduced 1hr), but their GROUP BY key value 
would be different.  
   
   if this is the case, we should remove `TimestampType` entirely from the 
whitelist, but `TimestampNTZType` should be safe because its timezone-agnostic.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to