Thanks for the feedback. I have a question here. I want to use numpy STD as well but just using sql in pyspark. Like below
sqltext = f""" SELECT rs.Customer_ID , rs.Number_of_orders , rs.Total_customer_amount , rs.Average_order , rs.Standard_deviation , rs.mystddev FROM ( SELECT cust_id AS Customer_ID , COUNT(amount_sold) AS Number_of_orders , SUM(amount_sold) AS Total_customer_amount , AVG(amount_sold) AS Average_order , STDDEV(amount_sold) AS Standard_deviation , SQRT((SUM(POWER(AMOUNT_SOLD,2))-(COUNT(1)*POWER(AVG(AMOUNT_SOLD),2)))/(COUNT(1)-1)) AS mystddev FROM {DB}.{table} GROUP BY cust_id HAVING SUM(amount_sold) > 94000 AND AVG(amount_sold) < STDDEV(amount_sold) ) rs ORDER BY 3 DESC """ spark.sql(sqltext) The green one (mystddev) is the way I work standard deviation myself. I want to add numpy STD to this sql as well but I don't think it is possible here. Can a UDF help? . Or it is only possible with FP? thanks On Thu, 24 Dec 2020 at 13:47, Sean Owen <sro...@gmail.com> wrote: > I don't know which one is 'correct' (it's not standard SQL?) or whether > it's the sample stdev for a good reason or just historical now. But you can > always call STDDEV_SAMP (in any DB) if needed. It's equivalent to numpy.std > with ddof=1, the Bessel-corrected standard deviation. > > On Thu, Dec 24, 2020 at 3:17 AM Mich Talebzadeh <mich.talebza...@gmail.com> > wrote: > >> >> Well the truth is that we had this discussion in 2016 :(. what Hive >> calls Standard Deviation Function STDDEV is a pointer to STDDEV_POP. This >> is incorrect and has not been rectified yet! >> >> >> Spark-sql, Oracle and Sybase point STDDEV to STDDEV_SAMP and not >> STDDEV_POP. Run a test on *Hive* >> >> >> SELECT >> >> SQRT((SUM(POWER(AMOUNT_SOLD,2))-(COUNT(1)*POWER(AVG(AMOUNT_SOLD),2)))/(COUNT(1)-1)) >> AS MYSTDDEV, >> STDDEV(amount_sold) AS STDDEV, >> STDDEV_SAMP(amount_sold) AS STDDEV_SAMP, >> STDDEV_POP(amount_sold) AS STDDEV_POP >> from sales; >> >> >> >> It comes back with >> >> >> +---------------------+--------------------+--------------------+--------------------+ >> | mystddev | stddev | stddev_samp | >> stddev_pop | >> >> +---------------------+--------------------+--------------------+--------------------+ >> | 458.59339942758265 | *458.5931211084447* | 458.5933504051772 | >> *458.5931211084447* | >> >> +---------------------+--------------------+--------------------+--------------------+ >> >> *spark-sql *comes back with >> >> >> spark-sql> SELECT >> >> > >> >> SQRT((SUM(POWER(AMOUNT_SOLD,2))-(COUNT(1)*POWER(AVG(AMOUNT_SOLD),2)))/(COUNT(1)-1)) >> AS MYSTDDEV, >> >> > STDDEV(amount_sold) AS STDDEV, >> >> > STDDEV_SAMP(amount_sold) AS STDDEV_SAMP, >> >> > STDDEV_POP(amount_sold) AS STDDEV_POP >> >> > from sales; >> >> >> >> +---------------------+--------------------+--------------------+--------------------+ >> | mystddev | stddev | stddev_samp | >> stddev_pop | >> >> +---------------------+--------------------+--------------------+--------------------+ >> >> 458.59339942758265 *458.5933504051778* *458.5933504051778* >> 458.59312110844525 >> >> Just wanted to see what numpy would come back with >> >> Thanks >> >> >> *Disclaimer:* Use it at your own risk. Any and all responsibility for >> any loss, damage or destruction of data or any other property which may >> arise from relying on this email's technical content is explicitly >> disclaimed. The author will in no case be liable for any monetary damages >> arising from such loss, damage or destruction. >> >> >> >> >> On Wed, 23 Dec 2020 at 23:50, Sean Owen <sro...@gmail.com> wrote: >> >>> Why do you want to use this function instead of the built-in stddev >>> function? >>> >>> On Wed, Dec 23, 2020 at 2:52 PM Mich Talebzadeh < >>> mich.talebza...@gmail.com> wrote: >>> >>>> Hi, >>>> >>>> >>>> This is a shot in the dark so to speak. >>>> >>>> >>>> I would like to use the standard deviation std offered by numpy in >>>> PySpark. I am using SQL for now >>>> >>>> >>>> The code as below >>>> >>>> >>>> sqltext = f""" >>>> >>>> SELECT >>>> >>>> rs.Customer_ID >>>> >>>> , rs.Number_of_orders >>>> >>>> , rs.Total_customer_amount >>>> >>>> , rs.Average_order >>>> >>>> , rs.Standard_deviation >>>> >>>> FROM >>>> >>>> ( >>>> >>>> SELECT cust_id AS Customer_ID, >>>> >>>> COUNT(amount_sold) AS Number_of_orders, >>>> >>>> SUM(amount_sold) AS Total_customer_amount, >>>> >>>> AVG(amount_sold) AS Average_order, >>>> >>>> * STDDEV(amount_sold) AS Standard_deviation* >>>> >>>> FROM {DB}.{table} >>>> >>>> GROUP BY cust_id >>>> >>>> HAVING SUM(amount_sold) > 94000 >>>> >>>> AND AVG(amount_sold) < STDDEV(amount_sold) >>>> >>>> ) rs >>>> >>>> ORDER BY >>>> >>>> 3 DESC >>>> >>>> """ >>>> >>>> spark.sql(sqltext) >>>> >>>> Now if I wanted to use UDF based on numpy STD function, I can do >>>> >>>> import numpy as np >>>> from pyspark.sql.functions import UserDefinedFunction >>>> from pyspark.sql.types import DoubleType >>>> udf = UserDefinedFunction(np.std, DoubleType()) >>>> >>>> How can I use that udf with spark SQL? I gather this is only possible >>>> through functional programming? >>>> >>>> Thanks, >>>> >>>> Mich >>>> >>>> >>>> >>>> >>>> LinkedIn * >>>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw >>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* >>>> >>>> >>>> >>>> >>>> >>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for >>>> any loss, damage or destruction of data or any other property which may >>>> arise from relying on this email's technical content is explicitly >>>> disclaimed. The author will in no case be liable for any monetary damages >>>> arising from such loss, damage or destruction. >>>> >>>> >>>> >>>