Hi Terry, Thanks for pointing it out.matplotlib's hist function wasn't broken after all :) I published non-parametric statistics here: http://ysar.net/python/python-package-statistics-additions.html
2013/10/18 Terry Reedy <tjre...@udel.edu>: > On 10/18/2013 8:41 AM, Yaşar Arabacı wrote: >> >> Hi people, >> >> I collected some data on PyPI and published some statistics about >> packages on PyPI. I think you might find it an interesting read: >> >> http://ysar.net/python/python-package-statistics.html > > > "b2gpopulate (36MB) > ... > Total sizes on packages in PyPI amounted to 4.2 GB. Average package size is > 161 KB and standard deviation is 1MB." > > For such highly skewed data, the mean and especially the standard deviation > and confidence intervals are meaningless. The are 'parameteric' statistics, > which is to say, were designed for bell-shaped distributions. (I will not > say 'normal' == Guassian distributions because they are *not* normal for > much raw data.) > > A better summary is obtained from either 'non-parametric' statistics > (median, inter-quartile range) or from 'normalizing' the data (if possible). > For the latter, try taking the square root or log of the sizes and plot the > distribution. If either works, take the mean and sd of the transformed > values. Then report those and also the transformed back mean and mean+-sd. > > -- > Terry Jan Reedy > > > -- > https://mail.python.org/mailman/listinfo/python-list -- http://ysar.net/ -- https://mail.python.org/mailman/listinfo/python-list