This has turned into a big thread for a simple thing and has been answered
3 times over now.

Neither is better, they just calculate different things. That the 'default'
is sample stddev is just convention.
stddev_pop is the simple standard deviation of a set of numbers
stddev_samp is used when the set of numbers is a sample from a notional
larger population, and you estimate the stddev of the population from the
sample.

They only differ in the denominator. Neither is more efficient at all or
more/less sensitive to outliers.

On Wed, Sep 20, 2023 at 3:06 AM Mich Talebzadeh <mich.talebza...@gmail.com>
wrote:

> Spark uses the sample standard deviation stddev_samp by default, whereas
> *Hive* uses population standard deviation stddev_pop as default.
>
> My understanding is that spark uses sample standard deviation by default
> because
>
>    - It is more commonly used.
>    - It is more efficient to calculate.
>    - It is less sensitive to outliers. (data points that differ
>    significantly from other observations in a dataset. They can be caused by a
>    variety of factors, such as measurement errors or edge events.)
>
> The sample standard deviation is less sensitive to outliers because it
> divides by N-1 instead of N. This means that a single outlier will have a
> smaller impact on the sample standard deviation than it would on the
> population standard deviation.
>
> HTH
>
> Mich Talebzadeh,
> Distinguished Technologist, Solutions Architect & Engineer
> London
> United Kingdom
>
>
>    view my Linkedin profile
> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Tue, 19 Sept 2023 at 21:50, Sean Owen <sro...@gmail.com> wrote:
>
>> Pyspark follows SQL databases here. stddev is stddev_samp, and sample
>> standard deviation is the calculation with the Bessel correction, n-1 in
>> the denominator. stddev_pop is simply standard deviation, with n in the
>> denominator.
>>
>> On Tue, Sep 19, 2023 at 7:13 AM Helene Bøe <helene.b...@hydro.com.invalid>
>> wrote:
>>
>>> Hi!
>>>
>>>
>>>
>>> I am applying the stddev function (so actually stddev_samp), however
>>> when comparing with the sample standard deviation in Excel the resuls do
>>> not match.
>>>
>>> I cannot find in your documentation any more specifics on how the sample
>>> standard deviation is calculated, so I cannot compare the difference toward
>>> excel, which uses
>>>
>>> .
>>>
>>> I am trying to avoid using Excel at all costs, but if the stddev_samp
>>> function is not calculating the standard deviation correctly I have a
>>> problem.
>>>
>>> I hope you can help me resolve this issue.
>>>
>>>
>>>
>>> Kindest regards,
>>>
>>>
>>>
>>> *Helene Bøe*
>>> *Graduate Project Engineer*
>>> Recycling Process & Support
>>>
>>> M: +47 980 00 887
>>> helene.b...@hydro.com
>>> <https://intra.hydro.com/EPiServer/CMS/Content/en/%2c%2c9/?epieditmode=False>
>>>
>>> Norsk Hydro ASA
>>> Drammensveien 264
>>> NO-0283 Oslo, Norway
>>> www.hydro.com
>>> <https://intra.hydro.com/EPiServer/CMS/Content/en/%2c%2c9/?epieditmode=False>
>>>
>>>
>>> NOTICE: This e-mail transmission, and any documents, files or previous
>>> e-mail messages attached to it, may contain confidential or privileged
>>> information. If you are not the intended recipient, or a person responsible
>>> for delivering it to the intended recipient, you are hereby notified that
>>> any disclosure, copying, distribution or use of any of the information
>>> contained in or attached to this message is STRICTLY PROHIBITED. If you
>>> have received this transmission in error, please immediately notify the
>>> sender and delete the e-mail and attached documents. Thank you.
>>>
>>

Reply via email to