Re: Exception when using some aggregate operators

Shagun Sodhani Wed, 28 Oct 2015 03:55:33 -0700

Ohh great! Thanks for the clarification.

On Wed, Oct 28, 2015 at 4:21 PM, Reynold Xin <[email protected]> wrote:


> No those are just functions for the DataFrame programming API.
>
> On Wed, Oct 28, 2015 at 11:49 AM, Shagun Sodhani <[email protected]
> > wrote:
>
>> @Reynold I seem to be missing something. Aren't the functions listed here
>> <http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.functions$>
>>  to
>> be treated as sql operators as well? I do see that these are mentioned as 
>> Functions
>> available for DataFrame
>> <http://spark.apache.org/docs/latest/api/scala/org/apache/spark/sql/DataFrame.html>
>>  but
>> it would be great if you can clarify this.
>>
>> On Wed, Oct 28, 2015 at 4:12 PM, Reynold Xin <[email protected]> wrote:
>>
>>> I don't think these are bugs. The SQL standard for average is "avg", not
>>> "mean". Similarly, a distinct count is supposed to be written as
>>> "count(distinct col)", not "countDistinct(col)".
>>>
>>> We can, however, make "mean" an alias for "avg" to improve compatibility
>>> between DataFrame and SQL.
>>>
>>>
>>> On Wed, Oct 28, 2015 at 11:38 AM, Shagun Sodhani <
>>> [email protected]> wrote:
>>>
>>>> Also are the other aggregate functions to be treated as bugs or not?
>>>>
>>>> On Wed, Oct 28, 2015 at 4:08 PM, Shagun Sodhani <
>>>> [email protected]> wrote:
>>>>
>>>>> Wouldnt it be:
>>>>>
>>>>> +    expression[Max]("avg"),
>>>>>
>>>>> On Wed, Oct 28, 2015 at 4:06 PM, Ted Yu <[email protected]> wrote:
>>>>>
>>>>>> Since there is already Average, the simplest change is the following:
>>>>>>
>>>>>> $ git diff
>>>>>> sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/FunctionRegistry.scala
>>>>>> diff --git
>>>>>> a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/FunctionRegistry.scala
>>>>>> b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Functi
>>>>>> index 3dce6c1..920f95b 100644
>>>>>> ---
>>>>>> a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/FunctionRegistry.scala
>>>>>> +++
>>>>>> b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/FunctionRegistry.scala
>>>>>> @@ -184,6 +184,7 @@ object FunctionRegistry {
>>>>>>      expression[Last]("last"),
>>>>>>      expression[Last]("last_value"),
>>>>>>      expression[Max]("max"),
>>>>>> +    expression[Average]("mean"),
>>>>>>      expression[Min]("min"),
>>>>>>      expression[Stddev]("stddev"),
>>>>>>      expression[StddevPop]("stddev_pop"),
>>>>>>
>>>>>> FYI
>>>>>>
>>>>>> On Wed, Oct 28, 2015 at 2:07 AM, Shagun Sodhani <
>>>>>> [email protected]> wrote:
>>>>>>
>>>>>>> I tried adding the aggregate functions in the registry and they
>>>>>>> work, other than mean, for which Ted has forwarded some code changes. I
>>>>>>> will try out those changes and update the status here.
>>>>>>>
>>>>>>> On Wed, Oct 28, 2015 at 9:03 AM, Shagun Sodhani <
>>>>>>> [email protected]> wrote:
>>>>>>>
>>>>>>>> Yup avg works good. So we have alternate functions to use in place
>>>>>>>> on the functions pointed out earlier. But my point is that are those
>>>>>>>> original aggregate functions not supposed to be used or I am using 
>>>>>>>> them in
>>>>>>>> the wrong way or is it a bug as I asked in my first mail.
>>>>>>>>
>>>>>>>> On Wed, Oct 28, 2015 at 3:20 AM, Ted Yu <[email protected]>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Have you tried using avg in place of mean ?
>>>>>>>>>
>>>>>>>>> (1 to 5).foreach { i => val df = (1 to 1000).map(j => (j,
>>>>>>>>> s"str$j")).toDF("a", "b").save(s"/tmp/partitioned/i=$i") }
>>>>>>>>>     sqlContext.sql("""
>>>>>>>>>     CREATE TEMPORARY TABLE partitionedParquet
>>>>>>>>>     USING org.apache.spark.sql.parquet
>>>>>>>>>     OPTIONS (
>>>>>>>>>       path '/tmp/partitioned'
>>>>>>>>>     )""")
>>>>>>>>> sqlContext.sql("""select avg(a) from partitionedParquet""").show()
>>>>>>>>>
>>>>>>>>> Cheers
>>>>>>>>>
>>>>>>>>> On Tue, Oct 27, 2015 at 10:12 AM, Shagun Sodhani <
>>>>>>>>> [email protected]> wrote:
>>>>>>>>>
>>>>>>>>>> So I tried @Reynold's suggestion. I could get countDistinct and
>>>>>>>>>> sumDistinct running but  mean and approxCountDistinct do not
>>>>>>>>>> work. (I guess I am using the wrong syntax for approxCountDistinct) 
>>>>>>>>>> For
>>>>>>>>>> mean, I think the registry entry is missing. Can someone clarify 
>>>>>>>>>> that as
>>>>>>>>>> well?
>>>>>>>>>>
>>>>>>>>>> On Tue, Oct 27, 2015 at 8:02 PM, Shagun Sodhani <
>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>
>>>>>>>>>>> Will try in a while when I get back. I assume this applies to
>>>>>>>>>>> all functions other than mean. Also countDistinct is defined along 
>>>>>>>>>>> with all
>>>>>>>>>>> other SQL functions. So I don't get "distinct is not part of 
>>>>>>>>>>> function name"
>>>>>>>>>>> part.
>>>>>>>>>>> On 27 Oct 2015 19:58, "Reynold Xin" <[email protected]> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Try
>>>>>>>>>>>>
>>>>>>>>>>>> count(distinct columnane)
>>>>>>>>>>>>
>>>>>>>>>>>> In SQL distinct is not part of the function name.
>>>>>>>>>>>>
>>>>>>>>>>>> On Tuesday, October 27, 2015, Shagun Sodhani <
>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Oops seems I made a mistake. The error message is : Exception
>>>>>>>>>>>>> in thread "main" org.apache.spark.sql.AnalysisException: 
>>>>>>>>>>>>> undefined function
>>>>>>>>>>>>> countDistinct
>>>>>>>>>>>>> On 27 Oct 2015 15:49, "Shagun Sodhani" <
>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Hi! I was trying out some aggregate  functions in SparkSql
>>>>>>>>>>>>>> and I noticed that certain aggregate operators are not working. 
>>>>>>>>>>>>>> This
>>>>>>>>>>>>>> includes:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> approxCountDistinct
>>>>>>>>>>>>>> countDistinct
>>>>>>>>>>>>>> mean
>>>>>>>>>>>>>> sumDistinct
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> For example using countDistinct results in an error saying
>>>>>>>>>>>>>> *Exception in thread "main"
>>>>>>>>>>>>>> org.apache.spark.sql.AnalysisException: undefined function cosh;*
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I had a similar issue with cosh operator
>>>>>>>>>>>>>> <http://apache-spark-developers-list.1001551.n3.nabble.com/Exception-when-using-cosh-td14724.html>
>>>>>>>>>>>>>> as well some time back and it turned out that it was not 
>>>>>>>>>>>>>> registered in the
>>>>>>>>>>>>>> registry:
>>>>>>>>>>>>>> https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/FunctionRegistry.scala
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> *I* *think it is the same issue again and would be glad to
>>>>>>>>>>>>>> send over a PR if someone can confirm if this is an actual bug 
>>>>>>>>>>>>>> and not some
>>>>>>>>>>>>>> mistake on my part.*
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Query I am using: SELECT countDistinct(`age`) as `data` FROM
>>>>>>>>>>>>>> `table`
>>>>>>>>>>>>>> Spark Version: 10.4
>>>>>>>>>>>>>> SparkSql Version: 1.5.1
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I am using the standard example of (name, age) schema (though
>>>>>>>>>>>>>> I am setting age as Double and not Int as I am trying out maths 
>>>>>>>>>>>>>> functions).
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> The entire error stack can be found here
>>>>>>>>>>>>>> <http://pastebin.com/G6YzQXnn>.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Thanks!
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Exception when using some aggregate operators

Reply via email to