Hi,
This is a shot in the dark so to speak.
I would like to use the standard deviation std offered by numpy in PySpark.
I am using SQL for now
The code as below
sqltext = f"""
SELECT
rs.Customer_ID
, rs.Number_of_orders
, rs.Total_customer_amount
, rs.Average_order
, rs.Standard_deviation
FROM
(
SELECT cust_id AS Customer_ID,
COUNT(amount_sold) AS Number_of_orders,
SUM(amount_sold) AS Total_customer_amount,
AVG(amount_sold) AS Average_order,
* STDDEV(amount_sold) AS Standard_deviation*
FROM {DB}.{table}
GROUP BY cust_id
HAVING SUM(amount_sold) > 94000
AND AVG(amount_sold) < STDDEV(amount_sold)
) rs
ORDER BY
3 DESC
"""
spark.sql(sqltext)
Now if I wanted to use UDF based on numpy STD function, I can do
import numpy as np
from pyspark.sql.functions import UserDefinedFunction
from pyspark.sql.types import DoubleType
udf = UserDefinedFunction(np.std, DoubleType())
How can I use that udf with spark SQL? I gather this is only possible
through functional programming?
Thanks,
Mich
LinkedIn *
https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.