This has turned into a big thread for a simple thing and has been answered 3 times over now.
Neither is better, they just calculate different things. That the 'default' is sample stddev is just convention. stddev_pop is the simple standard deviation of a set of numbers stddev_samp is used when the set of numbers is a sample from a notional larger population, and you estimate the stddev of the population from the sample. They only differ in the denominator. Neither is more efficient at all or more/less sensitive to outliers. On Wed, Sep 20, 2023 at 3:06 AM Mich Talebzadeh <mich.talebza...@gmail.com> wrote: > Spark uses the sample standard deviation stddev_samp by default, whereas > *Hive* uses population standard deviation stddev_pop as default. > > My understanding is that spark uses sample standard deviation by default > because > > - It is more commonly used. > - It is more efficient to calculate. > - It is less sensitive to outliers. (data points that differ > significantly from other observations in a dataset. They can be caused by a > variety of factors, such as measurement errors or edge events.) > > The sample standard deviation is less sensitive to outliers because it > divides by N-1 instead of N. This means that a single outlier will have a > smaller impact on the sample standard deviation than it would on the > population standard deviation. > > HTH > > Mich Talebzadeh, > Distinguished Technologist, Solutions Architect & Engineer > London > United Kingdom > > > view my Linkedin profile > <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> > > > https://en.everybodywiki.com/Mich_Talebzadeh > > > > *Disclaimer:* Use it at your own risk. Any and all responsibility for any > loss, damage or destruction of data or any other property which may arise > from relying on this email's technical content is explicitly disclaimed. > The author will in no case be liable for any monetary damages arising from > such loss, damage or destruction. > > > > > On Tue, 19 Sept 2023 at 21:50, Sean Owen <sro...@gmail.com> wrote: > >> Pyspark follows SQL databases here. stddev is stddev_samp, and sample >> standard deviation is the calculation with the Bessel correction, n-1 in >> the denominator. stddev_pop is simply standard deviation, with n in the >> denominator. >> >> On Tue, Sep 19, 2023 at 7:13 AM Helene Bøe <helene.b...@hydro.com.invalid> >> wrote: >> >>> Hi! >>> >>> >>> >>> I am applying the stddev function (so actually stddev_samp), however >>> when comparing with the sample standard deviation in Excel the resuls do >>> not match. >>> >>> I cannot find in your documentation any more specifics on how the sample >>> standard deviation is calculated, so I cannot compare the difference toward >>> excel, which uses >>> >>> . >>> >>> I am trying to avoid using Excel at all costs, but if the stddev_samp >>> function is not calculating the standard deviation correctly I have a >>> problem. >>> >>> I hope you can help me resolve this issue. >>> >>> >>> >>> Kindest regards, >>> >>> >>> >>> *Helene Bøe* >>> *Graduate Project Engineer* >>> Recycling Process & Support >>> >>> M: +47 980 00 887 >>> helene.b...@hydro.com >>> <https://intra.hydro.com/EPiServer/CMS/Content/en/%2c%2c9/?epieditmode=False> >>> >>> Norsk Hydro ASA >>> Drammensveien 264 >>> NO-0283 Oslo, Norway >>> www.hydro.com >>> <https://intra.hydro.com/EPiServer/CMS/Content/en/%2c%2c9/?epieditmode=False> >>> >>> >>> NOTICE: This e-mail transmission, and any documents, files or previous >>> e-mail messages attached to it, may contain confidential or privileged >>> information. If you are not the intended recipient, or a person responsible >>> for delivering it to the intended recipient, you are hereby notified that >>> any disclosure, copying, distribution or use of any of the information >>> contained in or attached to this message is STRICTLY PROHIBITED. If you >>> have received this transmission in error, please immediately notify the >>> sender and delete the e-mail and attached documents. Thank you. >>> >>