On Wed, Mar 15, 2017 at 11:06:20AM -0700, David Mertz wrote: > On Wed, Mar 15, 2017 at 10:39 AM, Steven D'Aprano <[email protected]> > wrote: > > > > But I can imagine an occasional need to, e.g. "find outliers." However, > > > that is not hard to spell as `mycounter.most_common()[-1*N:]`. Or if > > your > > > program does this often, write a utility function `find_outliers(...)` > > > > That's not how you find outliers :-) > > Just because a data point is uncommon doesn't mean it is an outlier. > > > > That's kinda *by definition* what an outlier is in categorical data! [...] > This isn't exactly statistics, but it's like your product example. There > are infinitely many random strings that occurred zero times among US > births. But a "rare name" is one that occurred at least once, not one of > these zero-occurring possible strings.
I'm not sure that "outlier" is defined for non-numeric data, or at least not formally defined. You'd need a definition of central location (which would be the mode) and a definition of spread, and I'm not sure how you would measure spread for categorical data. What's the spread of this data? ["Jack", "Jack", "Jill", "Jack"] The mode is clearly "Jack", but beyond that I'm not sure what can be said except to give the frequencies themselves. One commonly used definition of outlier (due to John Tukey) is: - divide your data into four equal quarters; - the points between each quarter are known as quartiles, and there are three of them: Q1, Q2 (the median), Q3; - define the Interquartile Range IQR = Q3 - Q2; - define lower and upper fences as Q1 - 1.5*IQR and Q3 + 1.5*IQR; - anything not between the lower and upper fences is an outlier. Or to be precise, a *suspected* outlier, since for very long tailed distributions, rare values are to be expected and should not be discarded without good reason. If your data is Gaussian, that corresponds to discarding roughly 1% of the most extreme values. > I realize from my example, however, that I'm probably more interested in > the actual uncommonality, not the specific `.least_common()`. I.e. I'd > like to know which names occurred fewer than 10 times... but I don't know > how many items that will include. Or as a percentage, which names occur in > fewer than 0.01% of births? Indeed. While the frequencies themselves are useful, the least_common(count) (by analogy with Counter.most_common) is not so useful. -- Steve _______________________________________________ Python-ideas mailing list [email protected] https://mail.python.org/mailman/listinfo/python-ideas Code of Conduct: http://python.org/psf/codeofconduct/
