Re: Using UDF based on Numpy functions in Spark SQL

Mich Talebzadeh Sat, 26 Dec 2020 06:58:26 -0800

Well I gave up on using anything except the standard one offered by PySpark
itself. The problem is that anything that is homemade (UDF), is never going
to be as performant as the functions offered by Spark itself. What I don't
understand is why a numpy STDDEV provided should be more performant than
the native one? Are they claiming the numpy packages are more optimised?


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.





On Thu, 24 Dec 2020 at 14:13, Sean Owen <sro...@gmail.com> wrote:

> Why not just use STDDEV_SAMP? it's probably more accurate than the
> differences-of-squares calculation.
> You can write an aggregate UDF that calls numpy and register it for SQL,
> but, it is already a built-in.
>
> On Thu, Dec 24, 2020 at 8:12 AM Mich Talebzadeh <mich.talebza...@gmail.com>
> wrote:
>
>> Thanks for the feedback.
>>
>> I have a question here. I want to use numpy STD as well but just using
>> sql in pyspark. Like below
>>
>>   sqltext = f"""
>>   SELECT
>>           rs.Customer_ID
>>         , rs.Number_of_orders
>>         , rs.Total_customer_amount
>>         , rs.Average_order
>>         , rs.Standard_deviation
>>         , rs.mystddev
>>   FROM
>>   (
>>            SELECT cust_id AS Customer_ID
>>         ,  COUNT(amount_sold) AS Number_of_orders
>>         ,  SUM(amount_sold) AS Total_customer_amount
>>         ,  AVG(amount_sold) AS Average_order
>>         ,  STDDEV(amount_sold) AS Standard_deviation
>>         ,
>> SQRT((SUM(POWER(AMOUNT_SOLD,2))-(COUNT(1)*POWER(AVG(AMOUNT_SOLD),2)))/(COUNT(1)-1))
>> AS mystddev
>>            FROM {DB}.{table}
>>            GROUP BY cust_id
>>            HAVING SUM(amount_sold) > 94000
>>            AND AVG(amount_sold) < STDDEV(amount_sold)
>>   ) rs
>>   ORDER BY
>>           3 DESC
>>   """
>>   spark.sql(sqltext)
>>
>> The green one (mystddev) is the way I work standard deviation myself. I
>> want to add numpy STD to this sql as well but I don't think it is possible
>> here. Can a UDF help? . Or it is only possible with FP?
>>
>> thanks
>>
>>>
>>>>>>
>>>>>

Re: Using UDF based on Numpy functions in Spark SQL

Reply via email to