Thanks for the feedback.
I have a question here. I want to use numpy STD as well but just using sql
in pyspark. Like below
sqltext = f"""
SELECT
rs.Customer_ID
, rs.Number_of_orders
, rs.Total_customer_amount
, rs.Average_order
, rs.Standard_deviation
, rs.mystddev
FROM
(
SELECT cust_id AS Customer_ID
, COUNT(amount_sold) AS Number_of_orders
, SUM(amount_sold) AS Total_customer_amount
, AVG(amount_sold) AS Average_order
, STDDEV(amount_sold) AS Standard_deviation
,
SQRT((SUM(POWER(AMOUNT_SOLD,2))-(COUNT(1)*POWER(AVG(AMOUNT_SOLD),2)))/(COUNT(1)-1))
AS mystddev
FROM {DB}.{table}
GROUP BY cust_id
HAVING SUM(amount_sold) > 94000
AND AVG(amount_sold) < STDDEV(amount_sold)
) rs
ORDER BY
3 DESC
"""
spark.sql(sqltext)
The green one (mystddev) is the way I work standard deviation myself. I
want to add numpy STD to this sql as well but I don't think it is possible
here. Can a UDF help? . Or it is only possible with FP?
thanks
On Thu, 24 Dec 2020 at 13:47, Sean Owen <[email protected]> wrote:
> I don't know which one is 'correct' (it's not standard SQL?) or whether
> it's the sample stdev for a good reason or just historical now. But you can
> always call STDDEV_SAMP (in any DB) if needed. It's equivalent to numpy.std
> with ddof=1, the Bessel-corrected standard deviation.
>
> On Thu, Dec 24, 2020 at 3:17 AM Mich Talebzadeh <[email protected]>
> wrote:
>
>>
>> Well the truth is that we had this discussion in 2016 :(. what Hive
>> calls Standard Deviation Function STDDEV is a pointer to STDDEV_POP. This
>> is incorrect and has not been rectified yet!
>>
>>
>> Spark-sql, Oracle and Sybase point STDDEV to STDDEV_SAMP and not
>> STDDEV_POP. Run a test on *Hive*
>>
>>
>> SELECT
>>
>> SQRT((SUM(POWER(AMOUNT_SOLD,2))-(COUNT(1)*POWER(AVG(AMOUNT_SOLD),2)))/(COUNT(1)-1))
>> AS MYSTDDEV,
>> STDDEV(amount_sold) AS STDDEV,
>> STDDEV_SAMP(amount_sold) AS STDDEV_SAMP,
>> STDDEV_POP(amount_sold) AS STDDEV_POP
>> from sales;
>>
>>
>>
>> It comes back with
>>
>>
>> +---------------------+--------------------+--------------------+--------------------+
>> | mystddev | stddev | stddev_samp |
>> stddev_pop |
>>
>> +---------------------+--------------------+--------------------+--------------------+
>> | 458.59339942758265 | *458.5931211084447* | 458.5933504051772 |
>> *458.5931211084447* |
>>
>> +---------------------+--------------------+--------------------+--------------------+
>>
>> *spark-sql *comes back with
>>
>>
>> spark-sql> SELECT
>>
>> >
>>
>> SQRT((SUM(POWER(AMOUNT_SOLD,2))-(COUNT(1)*POWER(AVG(AMOUNT_SOLD),2)))/(COUNT(1)-1))
>> AS MYSTDDEV,
>>
>> > STDDEV(amount_sold) AS STDDEV,
>>
>> > STDDEV_SAMP(amount_sold) AS STDDEV_SAMP,
>>
>> > STDDEV_POP(amount_sold) AS STDDEV_POP
>>
>> > from sales;
>>
>>
>>
>> +---------------------+--------------------+--------------------+--------------------+
>> | mystddev | stddev | stddev_samp |
>> stddev_pop |
>>
>> +---------------------+--------------------+--------------------+--------------------+
>>
>> 458.59339942758265 *458.5933504051778* *458.5933504051778*
>> 458.59312110844525
>>
>> Just wanted to see what numpy would come back with
>>
>> Thanks
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>>
>> On Wed, 23 Dec 2020 at 23:50, Sean Owen <[email protected]> wrote:
>>
>>> Why do you want to use this function instead of the built-in stddev
>>> function?
>>>
>>> On Wed, Dec 23, 2020 at 2:52 PM Mich Talebzadeh <
>>> [email protected]> wrote:
>>>
>>>> Hi,
>>>>
>>>>
>>>> This is a shot in the dark so to speak.
>>>>
>>>>
>>>> I would like to use the standard deviation std offered by numpy in
>>>> PySpark. I am using SQL for now
>>>>
>>>>
>>>> The code as below
>>>>
>>>>
>>>> sqltext = f"""
>>>>
>>>> SELECT
>>>>
>>>> rs.Customer_ID
>>>>
>>>> , rs.Number_of_orders
>>>>
>>>> , rs.Total_customer_amount
>>>>
>>>> , rs.Average_order
>>>>
>>>> , rs.Standard_deviation
>>>>
>>>> FROM
>>>>
>>>> (
>>>>
>>>> SELECT cust_id AS Customer_ID,
>>>>
>>>> COUNT(amount_sold) AS Number_of_orders,
>>>>
>>>> SUM(amount_sold) AS Total_customer_amount,
>>>>
>>>> AVG(amount_sold) AS Average_order,
>>>>
>>>> * STDDEV(amount_sold) AS Standard_deviation*
>>>>
>>>> FROM {DB}.{table}
>>>>
>>>> GROUP BY cust_id
>>>>
>>>> HAVING SUM(amount_sold) > 94000
>>>>
>>>> AND AVG(amount_sold) < STDDEV(amount_sold)
>>>>
>>>> ) rs
>>>>
>>>> ORDER BY
>>>>
>>>> 3 DESC
>>>>
>>>> """
>>>>
>>>> spark.sql(sqltext)
>>>>
>>>> Now if I wanted to use UDF based on numpy STD function, I can do
>>>>
>>>> import numpy as np
>>>> from pyspark.sql.functions import UserDefinedFunction
>>>> from pyspark.sql.types import DoubleType
>>>> udf = UserDefinedFunction(np.std, DoubleType())
>>>>
>>>> How can I use that udf with spark SQL? I gather this is only possible
>>>> through functional programming?
>>>>
>>>> Thanks,
>>>>
>>>> Mich
>>>>
>>>>
>>>>
>>>>
>>>> LinkedIn *
>>>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>>> any loss, damage or destruction of data or any other property which may
>>>> arise from relying on this email's technical content is explicitly
>>>> disclaimed. The author will in no case be liable for any monetary damages
>>>> arising from such loss, damage or destruction.
>>>>
>>>>
>>>>
>>>