Why not just use STDDEV_SAMP? it's probably more accurate than the
differences-of-squares calculation.
You can write an aggregate UDF that calls numpy and register it for SQL,
but, it is already a built-in.

On Thu, Dec 24, 2020 at 8:12 AM Mich Talebzadeh <mich.talebza...@gmail.com>
wrote:

> Thanks for the feedback.
>
> I have a question here. I want to use numpy STD as well but just using sql
> in pyspark. Like below
>
>   sqltext = f"""
>   SELECT
>           rs.Customer_ID
>         , rs.Number_of_orders
>         , rs.Total_customer_amount
>         , rs.Average_order
>         , rs.Standard_deviation
>         , rs.mystddev
>   FROM
>   (
>            SELECT cust_id AS Customer_ID
>         ,  COUNT(amount_sold) AS Number_of_orders
>         ,  SUM(amount_sold) AS Total_customer_amount
>         ,  AVG(amount_sold) AS Average_order
>         ,  STDDEV(amount_sold) AS Standard_deviation
>         ,
> SQRT((SUM(POWER(AMOUNT_SOLD,2))-(COUNT(1)*POWER(AVG(AMOUNT_SOLD),2)))/(COUNT(1)-1))
> AS mystddev
>            FROM {DB}.{table}
>            GROUP BY cust_id
>            HAVING SUM(amount_sold) > 94000
>            AND AVG(amount_sold) < STDDEV(amount_sold)
>   ) rs
>   ORDER BY
>           3 DESC
>   """
>   spark.sql(sqltext)
>
> The green one (mystddev) is the way I work standard deviation myself. I
> want to add numpy STD to this sql as well but I don't think it is possible
> here. Can a UDF help? . Or it is only possible with FP?
>
> thanks
>
>>
>>>>>
>>>>

Reply via email to