OK Thanks for the tip.

I found this link useful for Python from Databricks

User-defined functions - Python — Databricks Documentation
<https://docs.databricks.com/spark/latest/spark-sql/udf-python.html>

<https://docs.databricks.com/spark/latest/spark-sql/udf-python.html>



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*





*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Wed, 23 Dec 2020 at 21:31, Peyman Mohajerian <mohaj...@gmail.com> wrote:

>
> https://stackoverflow.com/questions/43484269/how-to-register-udf-to-use-in-sql-and-dataframe
>
> On Wed, Dec 23, 2020 at 12:52 PM Mich Talebzadeh <
> mich.talebza...@gmail.com> wrote:
>
>> Hi,
>>
>>
>> This is a shot in the dark so to speak.
>>
>>
>> I would like to use the standard deviation std offered by numpy in
>> PySpark. I am using SQL for now
>>
>>
>> The code as below
>>
>>
>>   sqltext = f"""
>>
>>   SELECT
>>
>>           rs.Customer_ID
>>
>>         , rs.Number_of_orders
>>
>>         , rs.Total_customer_amount
>>
>>         , rs.Average_order
>>
>>         , rs.Standard_deviation
>>
>>   FROM
>>
>>   (
>>
>>         SELECT cust_id AS Customer_ID,
>>
>>         COUNT(amount_sold) AS Number_of_orders,
>>
>>         SUM(amount_sold) AS Total_customer_amount,
>>
>>         AVG(amount_sold) AS Average_order,
>>
>>       *  STDDEV(amount_sold) AS Standard_deviation*
>>
>>         FROM {DB}.{table}
>>
>>         GROUP BY cust_id
>>
>>         HAVING SUM(amount_sold) > 94000
>>
>>         AND AVG(amount_sold) < STDDEV(amount_sold)
>>
>>   ) rs
>>
>>   ORDER BY
>>
>>           3 DESC
>>
>>   """
>>
>>   spark.sql(sqltext)
>>
>> Now if I wanted to use UDF based on numpy STD function, I can do
>>
>> import numpy as np
>> from pyspark.sql.functions import UserDefinedFunction
>> from pyspark.sql.types import DoubleType
>> udf = UserDefinedFunction(np.std, DoubleType())
>>
>> How can I use that udf with spark SQL? I gather this is only possible
>> through functional programming?
>>
>> Thanks,
>>
>> Mich
>>
>>
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>
>>
>>
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>

Reply via email to