Chris Nasrallah created SPARK-18358:
---------------------------------------
Summary: Multiple Aggregation Using 'countDistinct' and 'first'
result in error
Key: SPARK-18358
URL: https://issues.apache.org/jira/browse/SPARK-18358
Project: Spark
Issue Type: Bug
Environment: Mac OS X 10.9.5
Apache Spark 2.0.1
Hadoop 1.4
Reporter: Chris Nasrallah
Using pyspark, when I attempt to perform multiple aggregations on the same
groupBy object using the functions 'first' and 'countDistinct' it results in a
Py4JJavaError.
{code:borderStyle=solid}
from pyspark.sql import SparkSession
import pyspark.sql.functions as sfn
sparkSession = SparkSession.builder.master('local').getOrCreate()
df = spark.createDataFrame([
(1, 'a', 'z'),
(1, 'b', 'x'),
(1, 'a', 'y'),
(1, 'a', 'x'),
(2, 'b', 'z'),
(2, 'b', 'z')
], ['id', 'var1', 'var2'])
## Using two 'first' and one 'countDistinct' aggregations works
df.groupby('id') \
.agg(sfn.first('var1'), \
sfn.first('var2'), \
sfn.countDistinct('var1')).show()
## Using one 'max' with both 'countDistinct' works:
df.groupby('id') \
.agg(sfn.max('var2'), \
sfn.countDistinct('var1'), \
sfn.countDistinct('var2')).show()
## But using both 'countDistinct' with at least one 'first' crashes
df.groupby('id') \
.agg(sfn.first('var1'), \
sfn.first('var2'), \
sfn.countDistinct('var1'), \
sfn.countDistinct('var2')) \
.show()
{code}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]