[PR] [SPARK-51501][SQL] Disable ObjectHashAggregate for group by on collated columns [spark]

via GitHub Thu, 13 Mar 2025 09:48:06 -0700


stefankandic opened a new pull request, #50267:
URL: https://github.com/apache/spark/pull/50267


   ### What changes were proposed in this pull request?
   Disabling `ObjectHashAggregate` when grouping on columns with collations.
   
   
   ### Why are the changes needed?
   https://github.com/apache/spark/pull/45290 added support for sort based 
aggregation on collated columns and explicitly forbade the use of hash 
aggregate for collated columns. However, it did not consider the third type of 
aggregate, the object hash aggregate, which is only used when there are also 
TypedImperativeAggregate expressions present 
([source](https://github.com/apache/spark/blob/f3b081066393e1568c364b6d3bc0bceabd1e7e9f/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicLogicalOperators.scala#L1204)).
   
   That means that if we group by a collated column and also have a 
TypedImperativeAggregate we will end up using the object has aggregate which 
can lead to incorrect results like in the example below:
   
   ```code
   CREATE TABLE tbl(c1 STRING COLLATE UTF8_LCASE, c2 INT) USING PARQUET;
   INSERT INTO tbl VALUES ('HELLO', 1), ('hello', 2), ('HeLlO', 3);
   SELECT COLLECT_LIST(c2) as list FROM tbl GROUP BY c1;
   ```
   where the result would have three rows with values [1], [2] and [3] instead 
of one row with value [1, 2, 3].
   
   For this reason we should do the same thing as we did for the regular hash 
aggregate, make it so that it doesn't support grouping expressions on collated 
columns.
   
   
   ### Does this PR introduce _any_ user-facing change?
   No.
   
   
   ### How was this patch tested?
   New unit tests.
   
   
   ### Was this patch authored or co-authored using generative AI tooling?
   <!--
   If generative AI tooling has been used in the process of authoring this 
patch, please include the
   phrase: 'Generated-by: ' followed by the name of the tool and its version.
   If no, write 'No'.
   Please refer to the [ASF Generative Tooling 
Guidance](https://www.apache.org/legal/generative-tooling.html) for details.
   -->
   No.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[PR] [SPARK-51501][SQL] Disable ObjectHashAggregate for group by on collated columns [spark]

Reply via email to