Why not just use STDDEV_SAMP? it's probably more accurate than the differences-of-squares calculation. You can write an aggregate UDF that calls numpy and register it for SQL, but, it is already a built-in.
On Thu, Dec 24, 2020 at 8:12 AM Mich Talebzadeh <mich.talebza...@gmail.com> wrote: > Thanks for the feedback. > > I have a question here. I want to use numpy STD as well but just using sql > in pyspark. Like below > > sqltext = f""" > SELECT > rs.Customer_ID > , rs.Number_of_orders > , rs.Total_customer_amount > , rs.Average_order > , rs.Standard_deviation > , rs.mystddev > FROM > ( > SELECT cust_id AS Customer_ID > , COUNT(amount_sold) AS Number_of_orders > , SUM(amount_sold) AS Total_customer_amount > , AVG(amount_sold) AS Average_order > , STDDEV(amount_sold) AS Standard_deviation > , > SQRT((SUM(POWER(AMOUNT_SOLD,2))-(COUNT(1)*POWER(AVG(AMOUNT_SOLD),2)))/(COUNT(1)-1)) > AS mystddev > FROM {DB}.{table} > GROUP BY cust_id > HAVING SUM(amount_sold) > 94000 > AND AVG(amount_sold) < STDDEV(amount_sold) > ) rs > ORDER BY > 3 DESC > """ > spark.sql(sqltext) > > The green one (mystddev) is the way I work standard deviation myself. I > want to add numpy STD to this sql as well but I don't think it is possible > here. Can a UDF help? . Or it is only possible with FP? > > thanks > >> >>>>> >>>>