I don't know which one is 'correct' (it's not standard SQL?) or whether it's the sample stdev for a good reason or just historical now. But you can always call STDDEV_SAMP (in any DB) if needed. It's equivalent to numpy.std with ddof=1, the Bessel-corrected standard deviation.
On Thu, Dec 24, 2020 at 3:17 AM Mich Talebzadeh <mich.talebza...@gmail.com> wrote: > > Well the truth is that we had this discussion in 2016 :(. what Hive calls > Standard Deviation Function STDDEV is a pointer to STDDEV_POP. This is > incorrect and has not been rectified yet! > > > Spark-sql, Oracle and Sybase point STDDEV to STDDEV_SAMP and not > STDDEV_POP. Run a test on *Hive* > > > SELECT > > SQRT((SUM(POWER(AMOUNT_SOLD,2))-(COUNT(1)*POWER(AVG(AMOUNT_SOLD),2)))/(COUNT(1)-1)) > AS MYSTDDEV, > STDDEV(amount_sold) AS STDDEV, > STDDEV_SAMP(amount_sold) AS STDDEV_SAMP, > STDDEV_POP(amount_sold) AS STDDEV_POP > from sales; > > > > It comes back with > > > +---------------------+--------------------+--------------------+--------------------+ > | mystddev | stddev | stddev_samp | > stddev_pop | > > +---------------------+--------------------+--------------------+--------------------+ > | 458.59339942758265 | *458.5931211084447* | 458.5933504051772 | > *458.5931211084447* | > > +---------------------+--------------------+--------------------+--------------------+ > > *spark-sql *comes back with > > > spark-sql> SELECT > > > > > SQRT((SUM(POWER(AMOUNT_SOLD,2))-(COUNT(1)*POWER(AVG(AMOUNT_SOLD),2)))/(COUNT(1)-1)) > AS MYSTDDEV, > > > STDDEV(amount_sold) AS STDDEV, > > > STDDEV_SAMP(amount_sold) AS STDDEV_SAMP, > > > STDDEV_POP(amount_sold) AS STDDEV_POP > > > from sales; > > > > +---------------------+--------------------+--------------------+--------------------+ > | mystddev | stddev | stddev_samp | > stddev_pop | > > +---------------------+--------------------+--------------------+--------------------+ > > 458.59339942758265 *458.5933504051778* *458.5933504051778* > 458.59312110844525 > > Just wanted to see what numpy would come back with > > Thanks > > > *Disclaimer:* Use it at your own risk. Any and all responsibility for any > loss, damage or destruction of data or any other property which may arise > from relying on this email's technical content is explicitly disclaimed. > The author will in no case be liable for any monetary damages arising from > such loss, damage or destruction. > > > > > On Wed, 23 Dec 2020 at 23:50, Sean Owen <sro...@gmail.com> wrote: > >> Why do you want to use this function instead of the built-in stddev >> function? >> >> On Wed, Dec 23, 2020 at 2:52 PM Mich Talebzadeh < >> mich.talebza...@gmail.com> wrote: >> >>> Hi, >>> >>> >>> This is a shot in the dark so to speak. >>> >>> >>> I would like to use the standard deviation std offered by numpy in >>> PySpark. I am using SQL for now >>> >>> >>> The code as below >>> >>> >>> sqltext = f""" >>> >>> SELECT >>> >>> rs.Customer_ID >>> >>> , rs.Number_of_orders >>> >>> , rs.Total_customer_amount >>> >>> , rs.Average_order >>> >>> , rs.Standard_deviation >>> >>> FROM >>> >>> ( >>> >>> SELECT cust_id AS Customer_ID, >>> >>> COUNT(amount_sold) AS Number_of_orders, >>> >>> SUM(amount_sold) AS Total_customer_amount, >>> >>> AVG(amount_sold) AS Average_order, >>> >>> * STDDEV(amount_sold) AS Standard_deviation* >>> >>> FROM {DB}.{table} >>> >>> GROUP BY cust_id >>> >>> HAVING SUM(amount_sold) > 94000 >>> >>> AND AVG(amount_sold) < STDDEV(amount_sold) >>> >>> ) rs >>> >>> ORDER BY >>> >>> 3 DESC >>> >>> """ >>> >>> spark.sql(sqltext) >>> >>> Now if I wanted to use UDF based on numpy STD function, I can do >>> >>> import numpy as np >>> from pyspark.sql.functions import UserDefinedFunction >>> from pyspark.sql.types import DoubleType >>> udf = UserDefinedFunction(np.std, DoubleType()) >>> >>> How can I use that udf with spark SQL? I gather this is only possible >>> through functional programming? >>> >>> Thanks, >>> >>> Mich >>> >>> >>> >>> >>> LinkedIn * >>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw >>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* >>> >>> >>> >>> >>> >>> *Disclaimer:* Use it at your own risk. Any and all responsibility for >>> any loss, damage or destruction of data or any other property which may >>> arise from relying on this email's technical content is explicitly >>> disclaimed. The author will in no case be liable for any monetary damages >>> arising from such loss, damage or destruction. >>> >>> >>> >>