On Wed, Mar 15, 2017 at 11:14 AM, Brendan Barnwell <[email protected]>
wrote:

> Exactly.  If you have one data point that occurs once, another that occurs
> twice, another that occurs three times, and so on up to 10, then the "least
> common" one (or two or three) isn't an outlier.  To be an outlier, it would
> have to be "much less common than the rest".  That is, what matters is not
> the frequency rank but the magnitude of the separation in frequency between
> the outliers and the nonoutliers.  But that's a much subtler notion than
> just "least common".


OK. Fair enough.  Although exactly what separation in frequencies makes an
outlier is pretty fuzzy, especially in large samples where there are likely
to be no/few gaps at all per se.

It does tend to convince me that what we need is a more specialized class
in the `statistics` module.

In my large babynames dataset (a common one available from US Census), the
distribution in a linear scale is basically a vertical line followed by a
horizontal line.  It's much starker than a Zipf distribution.  On a semilog
scale, it has some choppiness in the tail, where you might define some as
"outliers" but it's not obvious what that cutoff would be.

x = range(len(names))
y = [t[1] for t in names.most_common()]
plt.plot(x, y)
plt.title("Baby name frequencies USA 1880-2011")
plt.semilogy()



-- 
Keeping medicines from the bloodstreams of the sick; food
from the bellies of the hungry; books from the hands of the
uneducated; technology from the underdeveloped; and putting
advocates of freedom in prisons.  Intellectual property is
to the 21st century what the slave trade was to the 16th.
_______________________________________________
Python-ideas mailing list
[email protected]
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/

Reply via email to