Rollup;)

Sakthi (Jira) Wed, 26 Feb 2025 21:42:05 -0800


    [ 
https://issues.apache.org/jira/browse/SPARK-38983?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17930982#comment-17930982
 ]


Sakthi commented on SPARK-38983:
--------------------------------

It's worth noting that the error message issue is fixed in the current main 
(master) branch:


{code:java}
>>> print(spark.version)
4.1.0-SNAPSHOT
>>> from pyspark.sql import functions as f
>>> from pyspark.sql import types as t
>>> l = [
...   ('a',),
...   ('b',),
... ]
>>> s = t.StructType([
...   t.StructField('col1', t.StringType())
... ])
>>> df = spark.createDataFrame(l, s)
>>> df.cube(f.col('col1')).agg(f.grouping('col1') & f.lit(True)).collect()
pyspark.errors.exceptions.captured.AnalysisException: 
[DATATYPE_MISMATCH.BINARY_OP_DIFF_TYPES] Cannot resolve "(grouping(col1) AND 
true)" due to data type mismatch: the left and right operands of the binary 
operator have incompatible types ("TINYINT" and "BOOLEAN"). SQLSTATE: 42K09;
{code}
 

 

> Pyspark throws AnalysisException with incorrect error message when using 
> .grouping() or .groupingId() (AnalysisException: grouping() can only be used 
> with GroupingSets/Cube/Rollup;)
> -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: SPARK-38983
>                 URL: https://issues.apache.org/jira/browse/SPARK-38983
>             Project: Spark
>          Issue Type: Bug
>          Components: PySpark
>    Affects Versions: 3.1.2, 3.2.1
>         Environment: I have reproduced this error in two environments. I 
> would be happy to answer questions about either.
> h1. Environment 1
> I first encountered this error on my employer's Azure Databricks cluster, 
> which runs Spark version 3.1.2. I have limited access to cluster 
> configuration information, but I can ask if it will help.
> h1. Environment 2
> I reproduced the error by running the same code in the Pyspark shell from 
> Spark 3.2.1 on my Chromebook (i.e. Crostini Linux). I have more access to 
> environment information here. Running {{spark-submit --version}} produced the 
> following output:
> {{Welcome to Spark version 3.2.1}}
> {{Using Scala version 2.12.15, OpenJDK 64-Bit Server VM, 11.0.14}}
> {{Branch HEAD}}
> {{Compiled by user hgao on 2022-01-20T19:26:14Z}}
> {{Revision 4f25b3f71238a00508a356591553f2dfa89f8290}}
> {{Url https://github.com/apache/spark}}
>            Reporter: Chris Kimmel
>            Priority: Minor
>              Labels: cube, error_message_improvement, exception-handling, 
> grouping, rollup
>
> h1. In a nutshell
> Pyspark emits an incorrect error message when committing a type error with 
> the results of the {{grouping()}} function.
> h1. Code to reproduce
> {{print(spark.version) # My environment, Azure DataBricks, defines spark 
> automatically.}}
> {{from pyspark.sql import functions as f}}
> {{{}from pyspark.sql import types as t{}}}{{{}l = [{}}}
> {{  ('a',),}}
> {{  ('b',),}}
> {{]}}
> {{s = t.StructType([}}
> {{  t.StructField('col1', t.StringType())}}
> {{])}}
> {{df = spark.createDataFrame(l, s)}}
> {{{}df.display(){}}}{{{}( # This expression raises an AnalysisException(){}}}
> {{  df}}
> {{  .cube(f.col('col1'))}}
> {{  .agg(f.grouping('col1') & f.lit(True))}}
> {{  .collect()}}
> {{)}}
> h1. Expected results
> The code produces an {{AnalysisException()}} with error message along the 
> lines of:
> {{AnalysisException: cannot resolve '(GROUPING(`col1`) AND true)' due to data 
> type mismatch: differing types in '(GROUPING(`col1`) AND true)' (int and 
> boolean).;}}
> h1. Actual results
> The code throws an {{AnalysisException()}} with error message
> {{AnalysisException: grouping() can only be used with 
> GroupingSets/Cube/Rollup;}}
> Python provides the following traceback:
> {{---------------------------------------------------------------------------}}
> {{AnalysisException                         Traceback (most recent call 
> last)}}
> {{<command-2283735107422632> in <module>}}
> {{     15 }}
> {{     16 ( # This expression raises an AnalysisException()}}
> {{---> 17   df}}
> {{     18   .cube(f.col('col1'))}}
> {{{}     19   .agg(f.grouping('col1') & 
> f.lit(True)){}}}{{{}/databricks/spark/python/pyspark/sql/group.py in 
> agg(self, *exprs){}}}
> {{    116             # Columns}}
> {{    117             assert all(isinstance(c, Column) for c in exprs), "all 
> exprs should be Column"}}
> {{--> 118             jdf = self._jgd.agg(exprs[0]._jc,}}
> {{    119                                 _to_seq(self.sql_ctx._sc, [c._jc 
> for c in exprs[1:]]))}}
> {{{}    120         return DataFrame(jdf, 
> self.sql_ctx){}}}{{{}/databricks/spark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py
>  in {_}{{_}}call{{_}}{_}(self, *args){}}}
> {{   1302 }}
> {{   1303         answer = self.gateway_client.send_command(command)}}
> {{-> 1304         return_value = get_return_value(}}
> {{   1305             answer, self.gateway_client, self.target_id, 
> self.name)}}
> {{   1306 }}{{/databricks/spark/python/pyspark/sql/utils.py in deco(*a, 
> **kw)}}
> {{    121                 # Hide where the exception came from that shows a 
> non-Pythonic}}
> {{    122                 # JVM exception message.}}
> {{--> 123                 raise converted from None}}
> {{    124             else:}}
> {{{}    125                 raise{}}}{{{}AnalysisException: grouping() can 
> only be used with GroupingSets/Cube/Rollup;{}}}
> {{'Aggregate [cube(col1#548)|#548)], [col1#548, (grouping(col1#548) AND true) 
> AS (grouping(col1) AND true)#551|#548, (grouping(col1#548) AND true) AS 
> (grouping(col1) AND true)#551]}}
> {{+- LogicalRDD [col1#548|#548], false}}
> h1. Workaround
> _Note:_ The reason I opened this ticket is that, when the user makes a 
> particular type error, the resulting error message is misleading. The code 
> snippet below shows how to fix that type error. It does not address the 
> false-error-message bug, which is the focus of this ticket.
> Cast the result of {{.grouping()}} to boolean type. That is, know _ab ovo_ 
> that {{.grouping()}} produces an integer 0 or 1 rather than a boolean True or 
> False.
> {{(  # This expression does not raise an AnalysisException()}}
> {{  df}}
> {{  .cube(f.col('col1'))}}
> {{  .agg(f.grouping('col1').cast(t.BooleanType()) & f.lit(True))}}
> {{  .collect()}}
> {{)}}
> h1. Additional notes
> The same error occurs if {{.cube()}} is replaced with {{.rollup()}} in "Code 
> to reproduce".
> The same error occurs if {{.grouping()}} is replaced with {{.grouping_id()}} 
> in "Code to reproduce".
> h1. Related tickets
> https://issues.apache.org/jira/browse/SPARK-22748
> h1. Relevant documentation
>  * [Spark SQL GROUPBY, ROLLUP, and CUBE 
> semantics|https://spark.apache.org/docs/latest/sql-ref-syntax-qry-select-groupby.html]
>  * 
> [DataFrame.cube()|https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.sql.DataFrame.cube.html]
>  * 
> [DataFrame.rollup()|https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.sql.DataFrame.rollup.html]
>  * 
> [DataFrame.agg()|https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.sql.DataFrame.agg.html]
>  * 
> [functions.grouping()|https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.sql.functions.grouping.html]
>  * 
> [functions.grouping_id()|https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.sql.functions.grouping_id.html]
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-38983) Pyspark throws AnalysisException with incorrect error message when using .grouping() or .groupingId() (AnalysisException: grouping() can only be used with GroupingSets/Cube/Rollup;)

Reply via email to