I modified DataFrameSuite, in master branch, to call percentile_approx instead of simpleUDF :
- deprecated callUdf in SQLContext - callUDF in SQLContext *** FAILED *** org.apache.spark.sql.AnalysisException: undefined function percentile_approx; at org.apache.spark.sql.catalyst.analysis.SimpleFunctionRegistry$$anonfun$2.apply(FunctionRegistry.scala:64) at org.apache.spark.sql.catalyst.analysis.SimpleFunctionRegistry$$anonfun$2.apply(FunctionRegistry.scala:64) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.sql.catalyst.analysis.SimpleFunctionRegistry.lookupFunction(FunctionRegistry.scala:63) at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveFunctions$$anonfun$apply$10$$anonfun$applyOrElse$5$$anonfun$applyOrElse$24.apply(Analyzer.scala:506) at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveFunctions$$anonfun$apply$10$$anonfun$applyOrElse$5$$anonfun$applyOrElse$24.apply(Analyzer.scala:506) at org.apache.spark.sql.catalyst.analysis.package$.withPosition(package.scala:48) at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveFunctions$$anonfun$apply$10$$anonfun$applyOrElse$5.applyOrElse(Analyzer.scala:505) at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveFunctions$$anonfun$apply$10$$anonfun$applyOrElse$5.applyOrElse(Analyzer.scala:502) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:227) SPARK-10671 is included. For 1.5.1, I guess the absence of SPARK-10671 means that SparkSQL treats percentile_approx as normal UDF. Experts can correct me, if there is any misunderstanding. Cheers On Tue, Oct 13, 2015 at 6:09 AM, Umesh Kacha <umesh.ka...@gmail.com> wrote: > Hi Ted I am using the following line of code I can't paste entire code > sorry but the following only line doesn't compile in my spark job > > sourceframe.select(callUDF("percentile_approx",col("mycol"), lit(0.25))) > > I am using Intellij editor java and maven dependencies of spark core spark > sql spark hive version 1.5.1 > On Oct 13, 2015 18:21, "Ted Yu" <yuzhih...@gmail.com> wrote: > >> Can you pastebin your Java code and the command you used to compile ? >> >> Thanks >> >> On Oct 13, 2015, at 1:42 AM, Umesh Kacha <umesh.ka...@gmail.com> wrote: >> >> Hi Ted if fix went after 1.5.1 release then how come it's working with >> 1.5.1 binary in spark-shell. >> On Oct 13, 2015 1:32 PM, "Ted Yu" <yuzhih...@gmail.com> wrote: >> >>> Looks like the fix went in after 1.5.1 was released. >>> >>> You may verify using master branch build. >>> >>> Cheers >>> >>> On Oct 13, 2015, at 12:21 AM, Umesh Kacha <umesh.ka...@gmail.com> wrote: >>> >>> Hi Ted, thanks much I tried using percentile_approx in Spark-shell like >>> you mentioned it works using 1.5.1 but it doesn't compile in Java using >>> 1.5.1 maven libraries it still complains same that callUdf can have string >>> and column types only. Please guide. >>> On Oct 13, 2015 12:34 AM, "Ted Yu" <yuzhih...@gmail.com> wrote: >>> >>>> SQL context available as sqlContext. >>>> >>>> scala> val df = Seq(("id1", 1), ("id2", 4), ("id3", 5)).toDF("id", >>>> "value") >>>> df: org.apache.spark.sql.DataFrame = [id: string, value: int] >>>> >>>> scala> df.select(callUDF("percentile_approx",col("value"), >>>> lit(0.25))).show() >>>> +------------------------------+ >>>> |'percentile_approx(value,0.25)| >>>> +------------------------------+ >>>> | 1.0| >>>> +------------------------------+ >>>> >>>> Can you upgrade to 1.5.1 ? >>>> >>>> Cheers >>>> >>>> On Mon, Oct 12, 2015 at 11:55 AM, Umesh Kacha <umesh.ka...@gmail.com> >>>> wrote: >>>> >>>>> Sorry forgot to tell that I am using Spark 1.4.1 as callUdf is >>>>> available in Spark 1.4.0 as per JAvadocx >>>>> >>>>> On Tue, Oct 13, 2015 at 12:22 AM, Umesh Kacha <umesh.ka...@gmail.com> >>>>> wrote: >>>>> >>>>>> Hi Ted thanks much for the detailed answer and appreciate your >>>>>> efforts. Do we need to register Hive UDFs? >>>>>> >>>>>> sqlContext.udf.register("percentile_approx");???//is it valid? >>>>>> >>>>>> I am calling Hive UDF percentile_approx in the following manner which >>>>>> gives compilation error >>>>>> >>>>>> df.select("col1").groupby("col1").agg(callUdf("percentile_approx",col("col1"),lit(0.25)));//compile >>>>>> error >>>>>> >>>>>> //compile error because callUdf() takes String and Column* as >>>>>> arguments. >>>>>> >>>>>> Please guide. Thanks much. >>>>>> >>>>>> On Mon, Oct 12, 2015 at 11:44 PM, Ted Yu <yuzhih...@gmail.com> wrote: >>>>>> >>>>>>> Using spark-shell, I did the following exercise (master branch) : >>>>>>> >>>>>>> >>>>>>> SQL context available as sqlContext. >>>>>>> >>>>>>> scala> val df = Seq(("id1", 1), ("id2", 4), ("id3", 5)).toDF("id", >>>>>>> "value") >>>>>>> df: org.apache.spark.sql.DataFrame = [id: string, value: int] >>>>>>> >>>>>>> scala> sqlContext.udf.register("simpleUDF", (v: Int, cnst: Int) => v >>>>>>> * v + cnst) >>>>>>> res0: org.apache.spark.sql.UserDefinedFunction = >>>>>>> UserDefinedFunction(<function2>,IntegerType,List()) >>>>>>> >>>>>>> scala> df.select($"id", callUDF("simpleUDF", $"value", >>>>>>> lit(25))).show() >>>>>>> +---+--------------------+ >>>>>>> | id|'simpleUDF(value,25)| >>>>>>> +---+--------------------+ >>>>>>> |id1| 26| >>>>>>> |id2| 41| >>>>>>> |id3| 50| >>>>>>> +---+--------------------+ >>>>>>> >>>>>>> Which Spark release are you using ? >>>>>>> >>>>>>> Can you pastebin the full stack trace where you got the error ? >>>>>>> >>>>>>> Cheers >>>>>>> >>>>>>> On Fri, Oct 9, 2015 at 1:09 PM, Umesh Kacha <umesh.ka...@gmail.com> >>>>>>> wrote: >>>>>>> >>>>>>>> I have a doubt Michael I tried to use callUDF in the following >>>>>>>> code it does not work. >>>>>>>> >>>>>>>> sourceFrame.agg(callUdf("percentile_approx",col("myCol"),lit(0.25))) >>>>>>>> >>>>>>>> Above code does not compile because callUdf() takes only two >>>>>>>> arguments function name in String and Column class type. Please guide. >>>>>>>> >>>>>>>> On Sat, Oct 10, 2015 at 1:29 AM, Umesh Kacha <umesh.ka...@gmail.com >>>>>>>> > wrote: >>>>>>>> >>>>>>>>> thanks much Michael let me try. >>>>>>>>> >>>>>>>>> On Sat, Oct 10, 2015 at 1:20 AM, Michael Armbrust < >>>>>>>>> mich...@databricks.com> wrote: >>>>>>>>> >>>>>>>>>> This is confusing because I made a typo... >>>>>>>>>> >>>>>>>>>> callUDF("percentile_approx", col("mycol"), lit(0.25)) >>>>>>>>>> >>>>>>>>>> The first argument is the name of the UDF, all other arguments >>>>>>>>>> need to be columns that are passed in as arguments. lit is just >>>>>>>>>> saying to >>>>>>>>>> make a literal column that always has the value 0.25. >>>>>>>>>> >>>>>>>>>> On Fri, Oct 9, 2015 at 12:16 PM, <saif.a.ell...@wellsfargo.com> >>>>>>>>>> wrote: >>>>>>>>>> >>>>>>>>>>> Yes but I mean, this is rather curious. How is def >>>>>>>>>>> lit(literal:Any) --> becomes a percentile function lit(25) >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> Thanks for clarification >>>>>>>>>>> >>>>>>>>>>> Saif >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> *From:* Umesh Kacha [mailto:umesh.ka...@gmail.com] >>>>>>>>>>> *Sent:* Friday, October 09, 2015 4:10 PM >>>>>>>>>>> *To:* Ellafi, Saif A. >>>>>>>>>>> *Cc:* Michael Armbrust; user >>>>>>>>>>> >>>>>>>>>>> *Subject:* Re: How to calculate percentile of a column of >>>>>>>>>>> DataFrame? >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> I found it in 1.3 documentation lit says something else not >>>>>>>>>>> percent >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> public static Column >>>>>>>>>>> <https://spark.apache.org/docs/1.3.1/api/java/org/apache/spark/sql/Column.html> >>>>>>>>>>> lit(Object literal) >>>>>>>>>>> >>>>>>>>>>> Creates a Column >>>>>>>>>>> <https://spark.apache.org/docs/1.3.1/api/java/org/apache/spark/sql/Column.html> >>>>>>>>>>> of >>>>>>>>>>> literal value. >>>>>>>>>>> >>>>>>>>>>> The passed in object is returned directly if it is already a >>>>>>>>>>> Column >>>>>>>>>>> <https://spark.apache.org/docs/1.3.1/api/java/org/apache/spark/sql/Column.html>. >>>>>>>>>>> If the object is a Scala Symbol, it is converted into a Column >>>>>>>>>>> <https://spark.apache.org/docs/1.3.1/api/java/org/apache/spark/sql/Column.html> >>>>>>>>>>> also. >>>>>>>>>>> Otherwise, a new Column >>>>>>>>>>> <https://spark.apache.org/docs/1.3.1/api/java/org/apache/spark/sql/Column.html> >>>>>>>>>>> is >>>>>>>>>>> created to represent the literal value. >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> On Sat, Oct 10, 2015 at 12:39 AM, <saif.a.ell...@wellsfargo.com> >>>>>>>>>>> wrote: >>>>>>>>>>> >>>>>>>>>>> Where can we find other available functions such as lit() ? I >>>>>>>>>>> can’t find lit in the api. >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> Thanks >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> *From:* Michael Armbrust [mailto:mich...@databricks.com] >>>>>>>>>>> *Sent:* Friday, October 09, 2015 4:04 PM >>>>>>>>>>> *To:* unk1102 >>>>>>>>>>> *Cc:* user >>>>>>>>>>> *Subject:* Re: How to calculate percentile of a column of >>>>>>>>>>> DataFrame? >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> You can use callUDF(col("mycol"), lit(0.25)) to call hive UDFs >>>>>>>>>>> from dataframes. >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> On Fri, Oct 9, 2015 at 12:01 PM, unk1102 <umesh.ka...@gmail.com> >>>>>>>>>>> wrote: >>>>>>>>>>> >>>>>>>>>>> Hi how to calculate percentile of a column in a DataFrame? I >>>>>>>>>>> cant find any >>>>>>>>>>> percentile_approx function in Spark aggregation functions. For >>>>>>>>>>> e.g. in Hive >>>>>>>>>>> we have percentile_approx and we can use it in the following way >>>>>>>>>>> >>>>>>>>>>> hiveContext.sql("select percentile_approx("mycol",0.25) from >>>>>>>>>>> myTable); >>>>>>>>>>> >>>>>>>>>>> I can see ntile function but not sure how it is gonna give >>>>>>>>>>> results same as >>>>>>>>>>> above query please guide. >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> -- >>>>>>>>>>> View this message in context: >>>>>>>>>>> http://apache-spark-user-list.1001560.n3.nabble.com/How-to-calculate-percentile-of-a-column-of-DataFrame-tp25000.html >>>>>>>>>>> Sent from the Apache Spark User List mailing list archive at >>>>>>>>>>> Nabble.com <http://nabble.com>. >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> --------------------------------------------------------------------- >>>>>>>>>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org >>>>>>>>>>> For additional commands, e-mail: user-h...@spark.apache.org >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>> >>>>>> >>>>> >>>>