yhuang-db commented on code in PR #51505:
URL: https://github.com/apache/spark/pull/51505#discussion_r2430820005


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/ApproxTopKAggregates.scala:
##########
@@ -317,8 +322,51 @@ object ApproxTopK {
   def getSketchStateDataType(itemDataType: DataType): StructType =
     StructType(
       StructField("sketch", BinaryType, nullable = false) ::
+        StructField("maxItemsTracked", IntegerType, nullable = false) ::
         StructField("itemDataType", itemDataType) ::
-        StructField("maxItemsTracked", IntegerType, nullable = false) :: Nil)
+        StructField("itemDataTypeDDL", StringType, nullable = false) :: Nil)
+
+  def dataTypeToDDL(dataType: DataType): String = dataType match {
+    case _: StringType =>
+      // Hide collation information in DDL format

Review Comment:
   Yes. IIUC, this test runs for all expressions, and assert that for string 
utf8Binary and string utf8Lcase the expression should have the same output, or 
throw the same exception. 
   
   If I force to output collation in toDDL/fromDDL, approx_top_k_accumulate has 
different outputs and fails on assertion
   
   > ArraySeq("{[04 01 0a 03 03 00 00 00 01 00 00 00 00 00 00 00 01 00 00 00 00 
00 00 00 00 00 00 00 00 00 00 00 01 00 00 00 00 00 00 00 0c 00 00 00 64 75 6d 
6d 79 20 73 74 72 69 6e 67], 5, null, item string collate utf8_binary not 
null}") did not equal ArraySeq("{[04 01 0a 03 03 00 00 00 01 00 00 00 00 00 00 
00 01 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 01 00 00 00 00 00 00 00 0c 
00 00 00 64 75 6d 6d 79 20 73 74 72 69 6e 67], 5, null, item string collate 
utf8_lcase not null}")
   
   If I simply use `StructField("item", dataType, nullable = false).toDDL` for 
string, approx_top_k_accumulate still has different outputs and fails on 
assertion
   
    >ArraySeq("{[04 01 0a 03 03 00 00 00 01 00 00 00 00 00 00 00 01 00 00 00 00 
00 00 00 00 00 00 00 00 00 00 00 01 00 00 00 00 00 00 00 0c 00 00 00 64 75 6d 
6d 79 20 73 74 72 69 6e 67], 5, null, item string not null}") did not equal 
ArraySeq("{[04 01 0a 03 03 00 00 00 01 00 00 00 00 00 00 00 01 00 00 00 00 00 
00 00 00 00 00 00 00 00 00 00 01 00 00 00 00 00 00 00 0c 00 00 00 64 75 6d 6d 
79 20 73 74 72 69 6e 67], 5, null, item string collate utf8_lcase not null}")



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to