[PR] [SPARK-52007] [SQL] Expression IDs shouldn't be present in grouping expressions when using grouping sets [spark]

via GitHub Mon, 05 May 2025 09:14:00 -0700


mihailoale-db opened a new pull request, #50791:
URL: https://github.com/apache/spark/pull/50791


   ### What changes were proposed in this pull request?
   In this PR I propose that we change `.toString` to `toPrettySQL` when 
constructing grouping expressions in `ResolveGroupingAnalytics` rule.
   
   ### Why are the changes needed?
   Right now following query would pass (`#x` and `#y` are expression IDs 
generated with every cluster start):
   
   `select * from values(1,2) group by grouping sets (col1,col2,col1+col2) 
order by `(col1#x + col2#y)``
   
   But with next cluster restart, expression IDs would be regenerated and the 
query would fail. Because of that we need to fix this to disallow this 
nondeterministic behavior.
   
   ### Does this PR introduce _any_ user-facing change?
   Some queries (and Dataframe programs) are going to fail but they would fail 
with every cluster restart (as explained above).
   
   ### How was this patch tested?
   Added tests.
   
   ### Was this patch authored or co-authored using generative AI tooling?
   No.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[PR] [SPARK-52007] [SQL] Expression IDs shouldn't be present in grouping expressions when using grouping sets [spark]

Reply via email to