Re: Using UDF based on Numpy functions in Spark SQL

Sean Owen Thu, 24 Dec 2020 05:48:06 -0800

I don't know which one is 'correct' (it's not standard SQL?) or whether
it's the sample stdev for a good reason or just historical now. But you can
always call STDDEV_SAMP (in any DB) if needed. It's equivalent to numpy.std
with ddof=1, the Bessel-corrected standard deviation.


On Thu, Dec 24, 2020 at 3:17 AM Mich Talebzadeh <mich.talebza...@gmail.com>
wrote:

>
> Well the truth is that we had this discussion in 2016 :(. what Hive calls
> Standard Deviation Function STDDEV is a pointer to STDDEV_POP. This is
> incorrect and has not been rectified yet!
>
>
> Spark-sql, Oracle and Sybase point STDDEV to STDDEV_SAMP and not
> STDDEV_POP.  Run a test on *Hive*
>
>
> SELECT
>
> SQRT((SUM(POWER(AMOUNT_SOLD,2))-(COUNT(1)*POWER(AVG(AMOUNT_SOLD),2)))/(COUNT(1)-1))
> AS MYSTDDEV,
>         STDDEV(amount_sold) AS STDDEV,
>         STDDEV_SAMP(amount_sold) AS STDDEV_SAMP,
>         STDDEV_POP(amount_sold) AS STDDEV_POP
> from    sales;
>
>
>
> It comes back with
>
>
> +---------------------+--------------------+--------------------+--------------------+
> |      mystddev       |       stddev       |    stddev_samp     |
>  stddev_pop     |
>
> +---------------------+--------------------+--------------------+--------------------+
> | 458.59339942758265  | *458.5931211084447*  | 458.5933504051772  |
> *458.5931211084447*  |
>
> +---------------------+--------------------+--------------------+--------------------+
>
> *spark-sql *comes back with
>
>
> spark-sql> SELECT
>
>          >
>  
> SQRT((SUM(POWER(AMOUNT_SOLD,2))-(COUNT(1)*POWER(AVG(AMOUNT_SOLD),2)))/(COUNT(1)-1))
> AS MYSTDDEV,
>
>          >         STDDEV(amount_sold) AS STDDEV,
>
>          >         STDDEV_SAMP(amount_sold) AS STDDEV_SAMP,
>
>          >         STDDEV_POP(amount_sold) AS STDDEV_POP
>
>          > from    sales;
>
>
>
> +---------------------+--------------------+--------------------+--------------------+
> |      mystddev       |       stddev       |    stddev_samp     |
>  stddev_pop     |
>
> +---------------------+--------------------+--------------------+--------------------+
>
> 458.59339942758265      *458.5933504051778*       *458.5933504051778*
>    458.59312110844525
>
> Just wanted to see what numpy would come back with
>
> Thanks
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Wed, 23 Dec 2020 at 23:50, Sean Owen <sro...@gmail.com> wrote:
>
>> Why do you want to use this function instead of the built-in stddev
>> function?
>>
>> On Wed, Dec 23, 2020 at 2:52 PM Mich Talebzadeh <
>> mich.talebza...@gmail.com> wrote:
>>
>>> Hi,
>>>
>>>
>>> This is a shot in the dark so to speak.
>>>
>>>
>>> I would like to use the standard deviation std offered by numpy in
>>> PySpark. I am using SQL for now
>>>
>>>
>>> The code as below
>>>
>>>
>>>   sqltext = f"""
>>>
>>>   SELECT
>>>
>>>           rs.Customer_ID
>>>
>>>         , rs.Number_of_orders
>>>
>>>         , rs.Total_customer_amount
>>>
>>>         , rs.Average_order
>>>
>>>         , rs.Standard_deviation
>>>
>>>   FROM
>>>
>>>   (
>>>
>>>         SELECT cust_id AS Customer_ID,
>>>
>>>         COUNT(amount_sold) AS Number_of_orders,
>>>
>>>         SUM(amount_sold) AS Total_customer_amount,
>>>
>>>         AVG(amount_sold) AS Average_order,
>>>
>>>       *  STDDEV(amount_sold) AS Standard_deviation*
>>>
>>>         FROM {DB}.{table}
>>>
>>>         GROUP BY cust_id
>>>
>>>         HAVING SUM(amount_sold) > 94000
>>>
>>>         AND AVG(amount_sold) < STDDEV(amount_sold)
>>>
>>>   ) rs
>>>
>>>   ORDER BY
>>>
>>>           3 DESC
>>>
>>>   """
>>>
>>>   spark.sql(sqltext)
>>>
>>> Now if I wanted to use UDF based on numpy STD function, I can do
>>>
>>> import numpy as np
>>> from pyspark.sql.functions import UserDefinedFunction
>>> from pyspark.sql.types import DoubleType
>>> udf = UserDefinedFunction(np.std, DoubleType())
>>>
>>> How can I use that udf with spark SQL? I gather this is only possible
>>> through functional programming?
>>>
>>> Thanks,
>>>
>>> Mich
>>>
>>>
>>>
>>>
>>> LinkedIn * 
>>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>
>>>
>>>
>>>
>>>
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>> any loss, damage or destruction of data or any other property which may
>>> arise from relying on this email's technical content is explicitly
>>> disclaimed. The author will in no case be liable for any monetary damages
>>> arising from such loss, damage or destruction.
>>>
>>>
>>>
>>

Re: Using UDF based on Numpy functions in Spark SQL

Reply via email to