This has turned into a big thread for a simple thing and has been answered
3 times over now.
Neither is better, they just calculate different things. That the 'default'
is sample stddev is just convention.
stddev_pop is the simple standard deviation of a set of numbers
stddev_samp is used when the
Spark uses the sample standard deviation stddev_samp by default, whereas
*Hive* uses population standard deviation stddev_pop as default.
My understanding is that spark uses sample standard deviation by default
because
- It is more commonly used.
- It is more efficient to calculate.
- It
Hi Helen,
Assuming you want to calculate stddev_samp, Spark correctly points STDDEV
to STDDEV_SAMP.
In below replace sales with your table name and AMOUNT_SOLD with the column
you want to do the calculation
SELECT
SQRT((SUM(POWER(AMOUNT_SOLD,2))-(COUNT(1)*POWER(AVG(AMOUNT_SOLD),2)))/(
from pyspark.sql import SparkSession
from pyspark.sql.functions import stddev_samp, stddev_pop
spark = SparkSession.builder.getOrCreate()
data = [(52.7,), (45.3,), (60.2,), (53.8,), (49.1,), (44.6,), (58.0,),
(56.5,), (47.9,), (50.3,)]
df = spark.createDataFrame(data, ["value"])
df.select(stddev
Pyspark follows SQL databases here. stddev is stddev_samp, and sample
standard deviation is the calculation with the Bessel correction, n-1 in
the denominator. stddev_pop is simply standard deviation, with n in the
denominator.
On Tue, Sep 19, 2023 at 7:13 AM Helene Bøe
wrote:
> Hi!
>
>
>
> I am