Re: How to calculate percentile of a column of DataFrame?

Umesh Kacha Tue, 13 Oct 2015 00:22:41 -0700

Hi Ted, thanks much I tried using percentile_approx in Spark-shell like you
mentioned it works using 1.5.1 but it doesn't compile in Java using 1.5.1
maven libraries it still complains same that callUdf can have string and
column types only. Please guide.
On Oct 13, 2015 12:34 AM, "Ted Yu" <yuzhih...@gmail.com> wrote:


> SQL context available as sqlContext.
>
> scala> val df = Seq(("id1", 1), ("id2", 4), ("id3", 5)).toDF("id", "value")
> df: org.apache.spark.sql.DataFrame = [id: string, value: int]
>
> scala> df.select(callUDF("percentile_approx",col("value"),
> lit(0.25))).show()
> +------------------------------+
> |'percentile_approx(value,0.25)|
> +------------------------------+
> |                           1.0|
> +------------------------------+
>
> Can you upgrade to 1.5.1 ?
>
> Cheers
>
> On Mon, Oct 12, 2015 at 11:55 AM, Umesh Kacha <umesh.ka...@gmail.com>
> wrote:
>
>> Sorry forgot to tell that I am using Spark 1.4.1 as callUdf is available
>> in Spark 1.4.0 as per JAvadocx
>>
>> On Tue, Oct 13, 2015 at 12:22 AM, Umesh Kacha <umesh.ka...@gmail.com>
>> wrote:
>>
>>> Hi Ted thanks much for the detailed answer and appreciate your efforts.
>>> Do we need to register Hive UDFs?
>>>
>>> sqlContext.udf.register("percentile_approx");???//is it valid?
>>>
>>> I am calling Hive UDF percentile_approx in the following manner which
>>> gives compilation error
>>>
>>> df.select("col1").groupby("col1").agg(callUdf("percentile_approx",col("col1"),lit(0.25)));//compile
>>> error
>>>
>>> //compile error because callUdf() takes String and Column* as arguments.
>>>
>>> Please guide. Thanks much.
>>>
>>> On Mon, Oct 12, 2015 at 11:44 PM, Ted Yu <yuzhih...@gmail.com> wrote:
>>>
>>>> Using spark-shell, I did the following exercise (master branch) :
>>>>
>>>>
>>>> SQL context available as sqlContext.
>>>>
>>>> scala> val df = Seq(("id1", 1), ("id2", 4), ("id3", 5)).toDF("id",
>>>> "value")
>>>> df: org.apache.spark.sql.DataFrame = [id: string, value: int]
>>>>
>>>> scala> sqlContext.udf.register("simpleUDF", (v: Int, cnst: Int) => v *
>>>> v + cnst)
>>>> res0: org.apache.spark.sql.UserDefinedFunction =
>>>> UserDefinedFunction(<function2>,IntegerType,List())
>>>>
>>>> scala> df.select($"id", callUDF("simpleUDF", $"value", lit(25))).show()
>>>> +---+--------------------+
>>>> | id|'simpleUDF(value,25)|
>>>> +---+--------------------+
>>>> |id1|                  26|
>>>> |id2|                  41|
>>>> |id3|                  50|
>>>> +---+--------------------+
>>>>
>>>> Which Spark release are you using ?
>>>>
>>>> Can you pastebin the full stack trace where you got the error ?
>>>>
>>>> Cheers
>>>>
>>>> On Fri, Oct 9, 2015 at 1:09 PM, Umesh Kacha <umesh.ka...@gmail.com>
>>>> wrote:
>>>>
>>>>> I have a doubt Michael I tried to use callUDF in  the following code
>>>>> it does not work.
>>>>>
>>>>> sourceFrame.agg(callUdf("percentile_approx",col("myCol"),lit(0.25)))
>>>>>
>>>>> Above code does not compile because callUdf() takes only two arguments
>>>>> function name in String and Column class type. Please guide.
>>>>>
>>>>> On Sat, Oct 10, 2015 at 1:29 AM, Umesh Kacha <umesh.ka...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> thanks much Michael let me try.
>>>>>>
>>>>>> On Sat, Oct 10, 2015 at 1:20 AM, Michael Armbrust <
>>>>>> mich...@databricks.com> wrote:
>>>>>>
>>>>>>> This is confusing because I made a typo...
>>>>>>>
>>>>>>> callUDF("percentile_approx", col("mycol"), lit(0.25))
>>>>>>>
>>>>>>> The first argument is the name of the UDF, all other arguments need
>>>>>>> to be columns that are passed in as arguments.  lit is just saying to 
>>>>>>> make
>>>>>>> a literal column that always has the value 0.25.
>>>>>>>
>>>>>>> On Fri, Oct 9, 2015 at 12:16 PM, <saif.a.ell...@wellsfargo.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Yes but I mean, this is rather curious. How is def lit(literal:Any)
>>>>>>>> --> becomes a percentile function lit(25)
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> Thanks for clarification
>>>>>>>>
>>>>>>>> Saif
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> *From:* Umesh Kacha [mailto:umesh.ka...@gmail.com]
>>>>>>>> *Sent:* Friday, October 09, 2015 4:10 PM
>>>>>>>> *To:* Ellafi, Saif A.
>>>>>>>> *Cc:* Michael Armbrust; user
>>>>>>>>
>>>>>>>> *Subject:* Re: How to calculate percentile of a column of
>>>>>>>> DataFrame?
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> I found it in 1.3 documentation lit says something else not percent
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> public static Column 
>>>>>>>> <https://spark.apache.org/docs/1.3.1/api/java/org/apache/spark/sql/Column.html>
>>>>>>>>  lit(Object literal)
>>>>>>>>
>>>>>>>> Creates a Column
>>>>>>>> <https://spark.apache.org/docs/1.3.1/api/java/org/apache/spark/sql/Column.html>
>>>>>>>>  of
>>>>>>>> literal value.
>>>>>>>>
>>>>>>>> The passed in object is returned directly if it is already a Column
>>>>>>>> <https://spark.apache.org/docs/1.3.1/api/java/org/apache/spark/sql/Column.html>.
>>>>>>>> If the object is a Scala Symbol, it is converted into a Column
>>>>>>>> <https://spark.apache.org/docs/1.3.1/api/java/org/apache/spark/sql/Column.html>
>>>>>>>>  also.
>>>>>>>> Otherwise, a new Column
>>>>>>>> <https://spark.apache.org/docs/1.3.1/api/java/org/apache/spark/sql/Column.html>
>>>>>>>>  is
>>>>>>>> created to represent the literal value.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Sat, Oct 10, 2015 at 12:39 AM, <saif.a.ell...@wellsfargo.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>> Where can we find other available functions such as lit() ? I can’t
>>>>>>>> find lit in the api.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> Thanks
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> *From:* Michael Armbrust [mailto:mich...@databricks.com]
>>>>>>>> *Sent:* Friday, October 09, 2015 4:04 PM
>>>>>>>> *To:* unk1102
>>>>>>>> *Cc:* user
>>>>>>>> *Subject:* Re: How to calculate percentile of a column of
>>>>>>>> DataFrame?
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> You can use callUDF(col("mycol"), lit(0.25)) to call hive UDFs
>>>>>>>> from dataframes.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Fri, Oct 9, 2015 at 12:01 PM, unk1102 <umesh.ka...@gmail.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>> Hi how to calculate percentile of a column in a DataFrame? I cant
>>>>>>>> find any
>>>>>>>> percentile_approx function in Spark aggregation functions. For e.g.
>>>>>>>> in Hive
>>>>>>>> we have percentile_approx and we can use it in the following way
>>>>>>>>
>>>>>>>> hiveContext.sql("select percentile_approx("mycol",0.25) from
>>>>>>>> myTable);
>>>>>>>>
>>>>>>>> I can see ntile function but not sure how it is gonna give results
>>>>>>>> same as
>>>>>>>> above query please guide.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>> View this message in context:
>>>>>>>> http://apache-spark-user-list.1001560.n3.nabble.com/How-to-calculate-percentile-of-a-column-of-DataFrame-tp25000.html
>>>>>>>> Sent from the Apache Spark User List mailing list archive at
>>>>>>>> Nabble.com.
>>>>>>>>
>>>>>>>>
>>>>>>>> ---------------------------------------------------------------------
>>>>>>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>>>>>>> For additional commands, e-mail: user-h...@spark.apache.org
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: How to calculate percentile of a column of DataFrame?

Reply via email to