Re: Using UDF based on Numpy functions in Spark SQL

Mich Talebzadeh Thu, 24 Dec 2020 06:12:50 -0800

Thanks for the feedback.

I have a question here. I want to use numpy STD as well but just using sql
in pyspark. Like below


  sqltext = f"""
  SELECT
          rs.Customer_ID
        , rs.Number_of_orders
        , rs.Total_customer_amount
        , rs.Average_order
        , rs.Standard_deviation
        , rs.mystddev
  FROM
  (
           SELECT cust_id AS Customer_ID
        ,  COUNT(amount_sold) AS Number_of_orders
        ,  SUM(amount_sold) AS Total_customer_amount
        ,  AVG(amount_sold) AS Average_order
        ,  STDDEV(amount_sold) AS Standard_deviation
        ,
SQRT((SUM(POWER(AMOUNT_SOLD,2))-(COUNT(1)*POWER(AVG(AMOUNT_SOLD),2)))/(COUNT(1)-1))
AS mystddev
           FROM {DB}.{table}
           GROUP BY cust_id
           HAVING SUM(amount_sold) > 94000
           AND AVG(amount_sold) < STDDEV(amount_sold)
  ) rs
  ORDER BY
          3 DESC
  """
  spark.sql(sqltext)

The green one (mystddev) is the way I work standard deviation myself. I
want to add numpy STD to this sql as well but I don't think it is possible
here. Can a UDF help? . Or it is only possible with FP?

thanks


On Thu, 24 Dec 2020 at 13:47, Sean Owen <sro...@gmail.com> wrote:

> I don't know which one is 'correct' (it's not standard SQL?) or whether
> it's the sample stdev for a good reason or just historical now. But you can
> always call STDDEV_SAMP (in any DB) if needed. It's equivalent to numpy.std
> with ddof=1, the Bessel-corrected standard deviation.
>
> On Thu, Dec 24, 2020 at 3:17 AM Mich Talebzadeh <mich.talebza...@gmail.com>
> wrote:
>
>>
>> Well the truth is that we had this discussion in 2016 :(. what Hive
>> calls Standard Deviation Function STDDEV is a pointer to STDDEV_POP. This
>> is incorrect and has not been rectified yet!
>>
>>
>> Spark-sql, Oracle and Sybase point STDDEV to STDDEV_SAMP and not
>> STDDEV_POP.  Run a test on *Hive*
>>
>>
>> SELECT
>>
>> SQRT((SUM(POWER(AMOUNT_SOLD,2))-(COUNT(1)*POWER(AVG(AMOUNT_SOLD),2)))/(COUNT(1)-1))
>> AS MYSTDDEV,
>>         STDDEV(amount_sold) AS STDDEV,
>>         STDDEV_SAMP(amount_sold) AS STDDEV_SAMP,
>>         STDDEV_POP(amount_sold) AS STDDEV_POP
>> from    sales;
>>
>>
>>
>> It comes back with
>>
>>
>> +---------------------+--------------------+--------------------+--------------------+
>> |      mystddev       |       stddev       |    stddev_samp     |
>>  stddev_pop     |
>>
>> +---------------------+--------------------+--------------------+--------------------+
>> | 458.59339942758265  | *458.5931211084447*  | 458.5933504051772  |
>> *458.5931211084447*  |
>>
>> +---------------------+--------------------+--------------------+--------------------+
>>
>> *spark-sql *comes back with
>>
>>
>> spark-sql> SELECT
>>
>>          >
>>  
>> SQRT((SUM(POWER(AMOUNT_SOLD,2))-(COUNT(1)*POWER(AVG(AMOUNT_SOLD),2)))/(COUNT(1)-1))
>> AS MYSTDDEV,
>>
>>          >         STDDEV(amount_sold) AS STDDEV,
>>
>>          >         STDDEV_SAMP(amount_sold) AS STDDEV_SAMP,
>>
>>          >         STDDEV_POP(amount_sold) AS STDDEV_POP
>>
>>          > from    sales;
>>
>>
>>
>> +---------------------+--------------------+--------------------+--------------------+
>> |      mystddev       |       stddev       |    stddev_samp     |
>>  stddev_pop     |
>>
>> +---------------------+--------------------+--------------------+--------------------+
>>
>> 458.59339942758265      *458.5933504051778*       *458.5933504051778*
>>    458.59312110844525
>>
>> Just wanted to see what numpy would come back with
>>
>> Thanks
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>>
>> On Wed, 23 Dec 2020 at 23:50, Sean Owen <sro...@gmail.com> wrote:
>>
>>> Why do you want to use this function instead of the built-in stddev
>>> function?
>>>
>>> On Wed, Dec 23, 2020 at 2:52 PM Mich Talebzadeh <
>>> mich.talebza...@gmail.com> wrote:
>>>
>>>> Hi,
>>>>
>>>>
>>>> This is a shot in the dark so to speak.
>>>>
>>>>
>>>> I would like to use the standard deviation std offered by numpy in
>>>> PySpark. I am using SQL for now
>>>>
>>>>
>>>> The code as below
>>>>
>>>>
>>>>   sqltext = f"""
>>>>
>>>>   SELECT
>>>>
>>>>           rs.Customer_ID
>>>>
>>>>         , rs.Number_of_orders
>>>>
>>>>         , rs.Total_customer_amount
>>>>
>>>>         , rs.Average_order
>>>>
>>>>         , rs.Standard_deviation
>>>>
>>>>   FROM
>>>>
>>>>   (
>>>>
>>>>         SELECT cust_id AS Customer_ID,
>>>>
>>>>         COUNT(amount_sold) AS Number_of_orders,
>>>>
>>>>         SUM(amount_sold) AS Total_customer_amount,
>>>>
>>>>         AVG(amount_sold) AS Average_order,
>>>>
>>>>       *  STDDEV(amount_sold) AS Standard_deviation*
>>>>
>>>>         FROM {DB}.{table}
>>>>
>>>>         GROUP BY cust_id
>>>>
>>>>         HAVING SUM(amount_sold) > 94000
>>>>
>>>>         AND AVG(amount_sold) < STDDEV(amount_sold)
>>>>
>>>>   ) rs
>>>>
>>>>   ORDER BY
>>>>
>>>>           3 DESC
>>>>
>>>>   """
>>>>
>>>>   spark.sql(sqltext)
>>>>
>>>> Now if I wanted to use UDF based on numpy STD function, I can do
>>>>
>>>> import numpy as np
>>>> from pyspark.sql.functions import UserDefinedFunction
>>>> from pyspark.sql.types import DoubleType
>>>> udf = UserDefinedFunction(np.std, DoubleType())
>>>>
>>>> How can I use that udf with spark SQL? I gather this is only possible
>>>> through functional programming?
>>>>
>>>> Thanks,
>>>>
>>>> Mich
>>>>
>>>>
>>>>
>>>>
>>>> LinkedIn * 
>>>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>>> any loss, damage or destruction of data or any other property which may
>>>> arise from relying on this email's technical content is explicitly
>>>> disclaimed. The author will in no case be liable for any monetary damages
>>>> arising from such loss, damage or destruction.
>>>>
>>>>
>>>>
>>>

Re: Using UDF based on Numpy functions in Spark SQL

Reply via email to