[
https://issues.apache.org/jira/browse/SPARK-57346?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18087784#comment-18087784
]
Anupam Yadav commented on SPARK-57346:
--------------------------------------
Investigated this with the single-pass resolver (Analyzer++). I found two
distinct issues:
1. *Crash*: With `spark.sql.analyzer.singlePassResolver.enabled=true`, these
queries crash with `SparkUnsupportedOperationException: Cannot call the method
dataType of BaseGroupingSets` before producing any results. I filed SPARK-57353
and a fix (https://github.com/apache/spark/pull/56417) that wires
GroupingAnalyticsResolver into AggregateResolver so the crash no longer occurs.
2. *Wrong results (this issue)*: After the crash fix, multi-column ROLLUP with
HAVING still produces wrong results under the single-pass resolver. For example:
{code:sql}
SELECT a, SUM(b) FROM VALUES (1,10),(1,20),(2,30) AS t(a,b)
GROUP BY ROLLUP(a, b) HAVING SUM(b) > 25;
{code}
- Legacy analyzer: 4 rows -- [1,null,30], [2,30,30], [2,null,30], [null,null,60]
- Single-pass resolver (after crash fix): 1 row -- [2,30,30]
The HAVING filter references the wrong aggregate expression after expansion.
This confirms SPARK-57346 is a real, separate bug that remains after the crash
is fixed. Could you confirm this matches what you observed? Was your repro on
the single-pass resolver or the default analyzer?
> [Analyzer++] HAVING/ORDER BY aggregates over GROUPING SETS give wrong results
> -----------------------------------------------------------------------------
>
> Key: SPARK-57346
> URL: https://issues.apache.org/jira/browse/SPARK-57346
> Project: Spark
> Issue Type: Bug
> Components: SQL
> Affects Versions: 4.2.0
> Reporter: Stefan Kandic
> Priority: Major
>
> SUM(b) in HAVING/ORDER BY above a GROUPING SETS/CUBE/ROLLUP aggregate
> resolves grouping column b to the Expand-output copy (NULL for rolled-up
> groups) instead of the original column. HAVING filters out rolled-up rows;
> ORDER BY sorts them as NULL.
> ORDER BY repro:
> SELECT a, b, SUM(b) FROM VALUES (1,10),(1,20),(2,30) AS t(a,b)
> GROUP BY CUBE(a, b) ORDER BY SUM(b);
>
>
> HAVING repro:
>
> SELECT a, SUM(b) FROM VALUES (1,10),(1,20),(2,30) AS t(a,b)
>
> GROUP BY ROLLUP(a, b) HAVING SUM(b) > 25;
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]