[ https://issues.apache.org/jira/browse/SPARK-38983?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17930982#comment-17930982 ]
Sakthi commented on SPARK-38983: -------------------------------- It's worth noting that the error message issue is fixed in the current main (master) branch: {code:java} >>> print(spark.version) 4.1.0-SNAPSHOT >>> from pyspark.sql import functions as f >>> from pyspark.sql import types as t >>> l = [ ... ('a',), ... ('b',), ... ] >>> s = t.StructType([ ... t.StructField('col1', t.StringType()) ... ]) >>> df = spark.createDataFrame(l, s) >>> df.cube(f.col('col1')).agg(f.grouping('col1') & f.lit(True)).collect() pyspark.errors.exceptions.captured.AnalysisException: [DATATYPE_MISMATCH.BINARY_OP_DIFF_TYPES] Cannot resolve "(grouping(col1) AND true)" due to data type mismatch: the left and right operands of the binary operator have incompatible types ("TINYINT" and "BOOLEAN"). SQLSTATE: 42K09; {code} > Pyspark throws AnalysisException with incorrect error message when using > .grouping() or .groupingId() (AnalysisException: grouping() can only be used > with GroupingSets/Cube/Rollup;) > ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- > > Key: SPARK-38983 > URL: https://issues.apache.org/jira/browse/SPARK-38983 > Project: Spark > Issue Type: Bug > Components: PySpark > Affects Versions: 3.1.2, 3.2.1 > Environment: I have reproduced this error in two environments. I > would be happy to answer questions about either. > h1. Environment 1 > I first encountered this error on my employer's Azure Databricks cluster, > which runs Spark version 3.1.2. I have limited access to cluster > configuration information, but I can ask if it will help. > h1. Environment 2 > I reproduced the error by running the same code in the Pyspark shell from > Spark 3.2.1 on my Chromebook (i.e. Crostini Linux). I have more access to > environment information here. Running {{spark-submit --version}} produced the > following output: > {{Welcome to Spark version 3.2.1}} > {{Using Scala version 2.12.15, OpenJDK 64-Bit Server VM, 11.0.14}} > {{Branch HEAD}} > {{Compiled by user hgao on 2022-01-20T19:26:14Z}} > {{Revision 4f25b3f71238a00508a356591553f2dfa89f8290}} > {{Url https://github.com/apache/spark}} > Reporter: Chris Kimmel > Priority: Minor > Labels: cube, error_message_improvement, exception-handling, > grouping, rollup > > h1. In a nutshell > Pyspark emits an incorrect error message when committing a type error with > the results of the {{grouping()}} function. > h1. Code to reproduce > {{print(spark.version) # My environment, Azure DataBricks, defines spark > automatically.}} > {{from pyspark.sql import functions as f}} > {{{}from pyspark.sql import types as t{}}}{{{}l = [{}}} > {{ ('a',),}} > {{ ('b',),}} > {{]}} > {{s = t.StructType([}} > {{ t.StructField('col1', t.StringType())}} > {{])}} > {{df = spark.createDataFrame(l, s)}} > {{{}df.display(){}}}{{{}( # This expression raises an AnalysisException(){}}} > {{ df}} > {{ .cube(f.col('col1'))}} > {{ .agg(f.grouping('col1') & f.lit(True))}} > {{ .collect()}} > {{)}} > h1. Expected results > The code produces an {{AnalysisException()}} with error message along the > lines of: > {{AnalysisException: cannot resolve '(GROUPING(`col1`) AND true)' due to data > type mismatch: differing types in '(GROUPING(`col1`) AND true)' (int and > boolean).;}} > h1. Actual results > The code throws an {{AnalysisException()}} with error message > {{AnalysisException: grouping() can only be used with > GroupingSets/Cube/Rollup;}} > Python provides the following traceback: > {{---------------------------------------------------------------------------}} > {{AnalysisException Traceback (most recent call > last)}} > {{<command-2283735107422632> in <module>}} > {{ 15 }} > {{ 16 ( # This expression raises an AnalysisException()}} > {{---> 17 df}} > {{ 18 .cube(f.col('col1'))}} > {{{} 19 .agg(f.grouping('col1') & > f.lit(True)){}}}{{{}/databricks/spark/python/pyspark/sql/group.py in > agg(self, *exprs){}}} > {{ 116 # Columns}} > {{ 117 assert all(isinstance(c, Column) for c in exprs), "all > exprs should be Column"}} > {{--> 118 jdf = self._jgd.agg(exprs[0]._jc,}} > {{ 119 _to_seq(self.sql_ctx._sc, [c._jc > for c in exprs[1:]]))}} > {{{} 120 return DataFrame(jdf, > self.sql_ctx){}}}{{{}/databricks/spark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py > in {_}{{_}}call{{_}}{_}(self, *args){}}} > {{ 1302 }} > {{ 1303 answer = self.gateway_client.send_command(command)}} > {{-> 1304 return_value = get_return_value(}} > {{ 1305 answer, self.gateway_client, self.target_id, > self.name)}} > {{ 1306 }}{{/databricks/spark/python/pyspark/sql/utils.py in deco(*a, > **kw)}} > {{ 121 # Hide where the exception came from that shows a > non-Pythonic}} > {{ 122 # JVM exception message.}} > {{--> 123 raise converted from None}} > {{ 124 else:}} > {{{} 125 raise{}}}{{{}AnalysisException: grouping() can > only be used with GroupingSets/Cube/Rollup;{}}} > {{'Aggregate [cube(col1#548)|#548)], [col1#548, (grouping(col1#548) AND true) > AS (grouping(col1) AND true)#551|#548, (grouping(col1#548) AND true) AS > (grouping(col1) AND true)#551]}} > {{+- LogicalRDD [col1#548|#548], false}} > h1. Workaround > _Note:_ The reason I opened this ticket is that, when the user makes a > particular type error, the resulting error message is misleading. The code > snippet below shows how to fix that type error. It does not address the > false-error-message bug, which is the focus of this ticket. > Cast the result of {{.grouping()}} to boolean type. That is, know _ab ovo_ > that {{.grouping()}} produces an integer 0 or 1 rather than a boolean True or > False. > {{( # This expression does not raise an AnalysisException()}} > {{ df}} > {{ .cube(f.col('col1'))}} > {{ .agg(f.grouping('col1').cast(t.BooleanType()) & f.lit(True))}} > {{ .collect()}} > {{)}} > h1. Additional notes > The same error occurs if {{.cube()}} is replaced with {{.rollup()}} in "Code > to reproduce". > The same error occurs if {{.grouping()}} is replaced with {{.grouping_id()}} > in "Code to reproduce". > h1. Related tickets > https://issues.apache.org/jira/browse/SPARK-22748 > h1. Relevant documentation > * [Spark SQL GROUPBY, ROLLUP, and CUBE > semantics|https://spark.apache.org/docs/latest/sql-ref-syntax-qry-select-groupby.html] > * > [DataFrame.cube()|https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.sql.DataFrame.cube.html] > * > [DataFrame.rollup()|https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.sql.DataFrame.rollup.html] > * > [DataFrame.agg()|https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.sql.DataFrame.agg.html] > * > [functions.grouping()|https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.sql.functions.grouping.html] > * > [functions.grouping_id()|https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.sql.functions.grouping_id.html] > -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org