[ 
https://issues.apache.org/jira/browse/SPARK-57346?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18087784#comment-18087784
 ] 

Anupam Yadav commented on SPARK-57346:
--------------------------------------

Investigated this with the single-pass resolver (Analyzer++). I found two 
distinct issues:

1. *Crash*: With `spark.sql.analyzer.singlePassResolver.enabled=true`, these 
queries crash with `SparkUnsupportedOperationException: Cannot call the method 
dataType of BaseGroupingSets` before producing any results. I filed SPARK-57353 
and a fix (https://github.com/apache/spark/pull/56417) that wires 
GroupingAnalyticsResolver into AggregateResolver so the crash no longer occurs.

2. *Wrong results (this issue)*: After the crash fix, multi-column ROLLUP with 
HAVING still produces wrong results under the single-pass resolver. For example:

{code:sql}
SELECT a, SUM(b) FROM VALUES (1,10),(1,20),(2,30) AS t(a,b)
GROUP BY ROLLUP(a, b) HAVING SUM(b) > 25;
{code}

- Legacy analyzer: 4 rows -- [1,null,30], [2,30,30], [2,null,30], [null,null,60]
- Single-pass resolver (after crash fix): 1 row -- [2,30,30]

The HAVING filter references the wrong aggregate expression after expansion. 
This confirms SPARK-57346 is a real, separate bug that remains after the crash 
is fixed. Could you confirm this matches what you observed? Was your repro on 
the single-pass resolver or the default analyzer?

> [Analyzer++] HAVING/ORDER BY aggregates over GROUPING SETS give wrong results
> -----------------------------------------------------------------------------
>
>                 Key: SPARK-57346
>                 URL: https://issues.apache.org/jira/browse/SPARK-57346
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 4.2.0
>            Reporter: Stefan Kandic
>            Priority: Major
>
> SUM(b) in HAVING/ORDER BY above a GROUPING SETS/CUBE/ROLLUP aggregate 
> resolves grouping column b to the Expand-output copy (NULL for rolled-up 
> groups) instead of the original column. HAVING filters out rolled-up rows; 
> ORDER BY sorts them as NULL.
> ORDER BY repro: 
> SELECT a, b, SUM(b) FROM VALUES (1,10),(1,20),(2,30) AS t(a,b)
> GROUP BY CUBE(a, b) ORDER BY SUM(b);                          
>                                                                               
>                
> HAVING repro:                                                                 
>              
> SELECT a, SUM(b) FROM VALUES (1,10),(1,20),(2,30) AS t(a,b)                   
>              
> GROUP BY ROLLUP(a, b) HAVING SUM(b) > 25;



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to