Re: How to calculate percentile of a column of DataFrame?

Ted Yu Tue, 13 Oct 2015 14:56:53 -0700

I modified DataFrameSuite, in master branch, to call percentile_approx
instead of simpleUDF :


- deprecated callUdf in SQLContext
- callUDF in SQLContext *** FAILED ***
  org.apache.spark.sql.AnalysisException: undefined function
percentile_approx;
  at
org.apache.spark.sql.catalyst.analysis.SimpleFunctionRegistry$$anonfun$2.apply(FunctionRegistry.scala:64)
  at
org.apache.spark.sql.catalyst.analysis.SimpleFunctionRegistry$$anonfun$2.apply(FunctionRegistry.scala:64)
  at scala.Option.getOrElse(Option.scala:120)
  at
org.apache.spark.sql.catalyst.analysis.SimpleFunctionRegistry.lookupFunction(FunctionRegistry.scala:63)
  at
org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveFunctions$$anonfun$apply$10$$anonfun$applyOrElse$5$$anonfun$applyOrElse$24.apply(Analyzer.scala:506)
  at
org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveFunctions$$anonfun$apply$10$$anonfun$applyOrElse$5$$anonfun$applyOrElse$24.apply(Analyzer.scala:506)
  at
org.apache.spark.sql.catalyst.analysis.package$.withPosition(package.scala:48)
  at
org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveFunctions$$anonfun$apply$10$$anonfun$applyOrElse$5.applyOrElse(Analyzer.scala:505)
  at
org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveFunctions$$anonfun$apply$10$$anonfun$applyOrElse$5.applyOrElse(Analyzer.scala:502)
  at
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:227)

SPARK-10671 is included.
For 1.5.1, I guess the absence of SPARK-10671 means that SparkSQL
treats percentile_approx as normal UDF.

Experts can correct me, if there is any misunderstanding.

Cheers

On Tue, Oct 13, 2015 at 6:09 AM, Umesh Kacha <[email protected]> wrote:

> Hi Ted I am using the following line of code I can't paste entire code
> sorry but the following only line doesn't compile in my spark job
>
>  sourceframe.select(callUDF("percentile_approx",col("mycol"), lit(0.25)))
>
> I am using Intellij editor java and maven dependencies of spark core spark
> sql spark hive version 1.5.1
> On Oct 13, 2015 18:21, "Ted Yu" <[email protected]> wrote:
>
>> Can you pastebin your Java code and the command you used to compile ?
>>
>> Thanks
>>
>> On Oct 13, 2015, at 1:42 AM, Umesh Kacha <[email protected]> wrote:
>>
>> Hi Ted if fix went after 1.5.1 release then how come it's working with
>> 1.5.1 binary in spark-shell.
>> On Oct 13, 2015 1:32 PM, "Ted Yu" <[email protected]> wrote:
>>
>>> Looks like the fix went in after 1.5.1 was released.
>>>
>>> You may verify using master branch build.
>>>
>>> Cheers
>>>
>>> On Oct 13, 2015, at 12:21 AM, Umesh Kacha <[email protected]> wrote:
>>>
>>> Hi Ted, thanks much I tried using percentile_approx in Spark-shell like
>>> you mentioned it works using 1.5.1 but it doesn't compile in Java using
>>> 1.5.1 maven libraries it still complains same that callUdf can have string
>>> and column types only. Please guide.
>>> On Oct 13, 2015 12:34 AM, "Ted Yu" <[email protected]> wrote:
>>>
>>>> SQL context available as sqlContext.
>>>>
>>>> scala> val df = Seq(("id1", 1), ("id2", 4), ("id3", 5)).toDF("id",
>>>> "value")
>>>> df: org.apache.spark.sql.DataFrame = [id: string, value: int]
>>>>
>>>> scala> df.select(callUDF("percentile_approx",col("value"),
>>>> lit(0.25))).show()
>>>> +------------------------------+
>>>> |'percentile_approx(value,0.25)|
>>>> +------------------------------+
>>>> |                           1.0|
>>>> +------------------------------+
>>>>
>>>> Can you upgrade to 1.5.1 ?
>>>>
>>>> Cheers
>>>>
>>>> On Mon, Oct 12, 2015 at 11:55 AM, Umesh Kacha <[email protected]>
>>>> wrote:
>>>>
>>>>> Sorry forgot to tell that I am using Spark 1.4.1 as callUdf is
>>>>> available in Spark 1.4.0 as per JAvadocx
>>>>>
>>>>> On Tue, Oct 13, 2015 at 12:22 AM, Umesh Kacha <[email protected]>
>>>>> wrote:
>>>>>
>>>>>> Hi Ted thanks much for the detailed answer and appreciate your
>>>>>> efforts. Do we need to register Hive UDFs?
>>>>>>
>>>>>> sqlContext.udf.register("percentile_approx");???//is it valid?
>>>>>>
>>>>>> I am calling Hive UDF percentile_approx in the following manner which
>>>>>> gives compilation error
>>>>>>
>>>>>> df.select("col1").groupby("col1").agg(callUdf("percentile_approx",col("col1"),lit(0.25)));//compile
>>>>>> error
>>>>>>
>>>>>> //compile error because callUdf() takes String and Column* as
>>>>>> arguments.
>>>>>>
>>>>>> Please guide. Thanks much.
>>>>>>
>>>>>> On Mon, Oct 12, 2015 at 11:44 PM, Ted Yu <[email protected]> wrote:
>>>>>>
>>>>>>> Using spark-shell, I did the following exercise (master branch) :
>>>>>>>
>>>>>>>
>>>>>>> SQL context available as sqlContext.
>>>>>>>
>>>>>>> scala> val df = Seq(("id1", 1), ("id2", 4), ("id3", 5)).toDF("id",
>>>>>>> "value")
>>>>>>> df: org.apache.spark.sql.DataFrame = [id: string, value: int]
>>>>>>>
>>>>>>> scala> sqlContext.udf.register("simpleUDF", (v: Int, cnst: Int) => v
>>>>>>> * v + cnst)
>>>>>>> res0: org.apache.spark.sql.UserDefinedFunction =
>>>>>>> UserDefinedFunction(<function2>,IntegerType,List())
>>>>>>>
>>>>>>> scala> df.select($"id", callUDF("simpleUDF", $"value",
>>>>>>> lit(25))).show()
>>>>>>> +---+--------------------+
>>>>>>> | id|'simpleUDF(value,25)|
>>>>>>> +---+--------------------+
>>>>>>> |id1|                  26|
>>>>>>> |id2|                  41|
>>>>>>> |id3|                  50|
>>>>>>> +---+--------------------+
>>>>>>>
>>>>>>> Which Spark release are you using ?
>>>>>>>
>>>>>>> Can you pastebin the full stack trace where you got the error ?
>>>>>>>
>>>>>>> Cheers
>>>>>>>
>>>>>>> On Fri, Oct 9, 2015 at 1:09 PM, Umesh Kacha <[email protected]>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> I have a doubt Michael I tried to use callUDF in  the following
>>>>>>>> code it does not work.
>>>>>>>>
>>>>>>>> sourceFrame.agg(callUdf("percentile_approx",col("myCol"),lit(0.25)))
>>>>>>>>
>>>>>>>> Above code does not compile because callUdf() takes only two
>>>>>>>> arguments function name in String and Column class type. Please guide.
>>>>>>>>
>>>>>>>> On Sat, Oct 10, 2015 at 1:29 AM, Umesh Kacha <[email protected]
>>>>>>>> > wrote:
>>>>>>>>
>>>>>>>>> thanks much Michael let me try.
>>>>>>>>>
>>>>>>>>> On Sat, Oct 10, 2015 at 1:20 AM, Michael Armbrust <
>>>>>>>>> [email protected]> wrote:
>>>>>>>>>
>>>>>>>>>> This is confusing because I made a typo...
>>>>>>>>>>
>>>>>>>>>> callUDF("percentile_approx", col("mycol"), lit(0.25))
>>>>>>>>>>
>>>>>>>>>> The first argument is the name of the UDF, all other arguments
>>>>>>>>>> need to be columns that are passed in as arguments.  lit is just 
>>>>>>>>>> saying to
>>>>>>>>>> make a literal column that always has the value 0.25.
>>>>>>>>>>
>>>>>>>>>> On Fri, Oct 9, 2015 at 12:16 PM, <[email protected]>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>> Yes but I mean, this is rather curious. How is def
>>>>>>>>>>> lit(literal:Any) --> becomes a percentile function lit(25)
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Thanks for clarification
>>>>>>>>>>>
>>>>>>>>>>> Saif
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> *From:* Umesh Kacha [mailto:[email protected]]
>>>>>>>>>>> *Sent:* Friday, October 09, 2015 4:10 PM
>>>>>>>>>>> *To:* Ellafi, Saif A.
>>>>>>>>>>> *Cc:* Michael Armbrust; user
>>>>>>>>>>>
>>>>>>>>>>> *Subject:* Re: How to calculate percentile of a column of
>>>>>>>>>>> DataFrame?
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> I found it in 1.3 documentation lit says something else not
>>>>>>>>>>> percent
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> public static Column 
>>>>>>>>>>> <https://spark.apache.org/docs/1.3.1/api/java/org/apache/spark/sql/Column.html>
>>>>>>>>>>>  lit(Object literal)
>>>>>>>>>>>
>>>>>>>>>>> Creates a Column
>>>>>>>>>>> <https://spark.apache.org/docs/1.3.1/api/java/org/apache/spark/sql/Column.html>
>>>>>>>>>>>  of
>>>>>>>>>>> literal value.
>>>>>>>>>>>
>>>>>>>>>>> The passed in object is returned directly if it is already a
>>>>>>>>>>> Column
>>>>>>>>>>> <https://spark.apache.org/docs/1.3.1/api/java/org/apache/spark/sql/Column.html>.
>>>>>>>>>>> If the object is a Scala Symbol, it is converted into a Column
>>>>>>>>>>> <https://spark.apache.org/docs/1.3.1/api/java/org/apache/spark/sql/Column.html>
>>>>>>>>>>>  also.
>>>>>>>>>>> Otherwise, a new Column
>>>>>>>>>>> <https://spark.apache.org/docs/1.3.1/api/java/org/apache/spark/sql/Column.html>
>>>>>>>>>>>  is
>>>>>>>>>>> created to represent the literal value.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Sat, Oct 10, 2015 at 12:39 AM, <[email protected]>
>>>>>>>>>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>> Where can we find other available functions such as lit() ? I
>>>>>>>>>>> can’t find lit in the api.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Thanks
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> *From:* Michael Armbrust [mailto:[email protected]]
>>>>>>>>>>> *Sent:* Friday, October 09, 2015 4:04 PM
>>>>>>>>>>> *To:* unk1102
>>>>>>>>>>> *Cc:* user
>>>>>>>>>>> *Subject:* Re: How to calculate percentile of a column of
>>>>>>>>>>> DataFrame?
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> You can use callUDF(col("mycol"), lit(0.25)) to call hive UDFs
>>>>>>>>>>> from dataframes.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Fri, Oct 9, 2015 at 12:01 PM, unk1102 <[email protected]>
>>>>>>>>>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>> Hi how to calculate percentile of a column in a DataFrame? I
>>>>>>>>>>> cant find any
>>>>>>>>>>> percentile_approx function in Spark aggregation functions. For
>>>>>>>>>>> e.g. in Hive
>>>>>>>>>>> we have percentile_approx and we can use it in the following way
>>>>>>>>>>>
>>>>>>>>>>> hiveContext.sql("select percentile_approx("mycol",0.25) from
>>>>>>>>>>> myTable);
>>>>>>>>>>>
>>>>>>>>>>> I can see ntile function but not sure how it is gonna give
>>>>>>>>>>> results same as
>>>>>>>>>>> above query please guide.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> --
>>>>>>>>>>> View this message in context:
>>>>>>>>>>> http://apache-spark-user-list.1001560.n3.nabble.com/How-to-calculate-percentile-of-a-column-of-DataFrame-tp25000.html
>>>>>>>>>>> Sent from the Apache Spark User List mailing list archive at
>>>>>>>>>>> Nabble.com <http://nabble.com>.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> ---------------------------------------------------------------------
>>>>>>>>>>> To unsubscribe, e-mail: [email protected]
>>>>>>>>>>> For additional commands, e-mail: [email protected]
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>

Re: How to calculate percentile of a column of DataFrame?

Reply via email to