hey guys,

I have a hobby project that sorts my email automatically for me & I want to improve it.  There's data science and statistical info that I'm missing, & I always enjoy reading about the pythonic way to do things too.

I have a list of percentage scores:

(1,11,1,7,5,7,2,2,2,10,10,1,2,2,1,7,2,1,7,5,3,8,2,6,3,2,7,2,12,3,1,2,19,3,5,1,1,7,8,8,1,5,6,7,3,14,6,1,6,7,6,15,6,3,7,2,6,23,2,7,1,21,21,8,8,3,2,20,1,3,12,3,1,2,10,16,16,15,6,5,3,2,2,11,1,14,6,3,7,1,5,3,3,14,3,7,3,5,8,3,6,17,1,1,7,3,1,2,6,1,7,7,12,6,6,2,1,6,3,6,2,1,5,1,8,10,2,6,1,7,3,5,7,7,5,7,2,5,1,19,19,1,12,5,10,2,19,1,3,19,6,1,5,11,2,1,2,5,2,5,8,2,2,2,5,3,1,21,2,3,7,10,1,8,1,3,17,17,1,5,3,10,14,1,2,14,14,1,15,6,3,2,17,17,1,1,1,2,2,3,3,2,2,7,7,2,1,2,8,2,20,3,2,3,12,7,6,5,12,2,3,11,3,1,1,8,16,10,1,6,6,6,11,1,6,5,2,5,11,1,2,10,6,14,6,3,3,5,2,6,17,15,1,2,2,17,5,3,3,5,8,1,6,3,14,3,2,1,7,2,8,11,5,14,3,19,1,3,7,3,3,8,8,6,1,3,1,14,14,10,3,2,1,12,2,3,1,2,2,6,6,7,10,10,12,24,1,21,21,5,11,12,12,2,1,19,8,6,2,1,1,19,10,6,2,15,15,7,10,14,12,14,5,11,7,12,2,1,14,10,7,10,3,17,25,10,5,5,3,12,5,2,14,5,8,1,11,5,29,2,7,20,12,14,1,10,6,17,16,6,7,11,12,3,1,23,11,10,11,5,10,6,2,17,15,20,5,10,1,17,3,7,15,5,11,6,19,14,15,7,1,2,17,8,15,10,26,6,1,2,10,6,14,12,6,1,16,6,12,10,10,14,1,6,1,6,6,12,6,6,1,2,5,10,8,10,1,6,8,17,11,6,3,6,5,1,2,1,2,6,6,12,14,7,1,7,1,8,2,3,14,11,6,3,11,3,1,6,17,12,8,2,10,3,12,12,2,7,5,5,17,2,5,10,12,21,15,6,10,10,7,15,11,2,7,10,3,1,2,7,10,15,1,1,6,5,5,3,17,19,7,1,15,2,8,7,1,6,2,1,15,19,7,15,1,8,3,3,20,8,1,11,7,8,7,1,12,11,1,10,17,2,23,3,7,20,20,3,11,5,1,1,8,1,6,2,11,1,5,1,10,7,20,17,8,1,2,10,6,2,1,23,11,11,7,2,21,5,5,8,1,1,10,12,15,2,1,10,5,2,2,5,1,2,11,10,1,8,10,12,2,12,2,8,6,19,15,8,2,16,7,5,14,2,1,3,3,10,16,20,5,8,14,8,3,14,2,1,5,16,16,2,10,8,17,17,10,10,11,3,5,1,17,17,3,17,5,6,7,7,12,19,15,20,11,10,2,6,6,5,5,1,16,16,8,7,2,1,3,5,20,20,6,7,5,23,14,3,10,2,2,7,10,10,3,5,5,8,14,11,14,14,11,19,5,5,2,12,25,5,2,11,8,10,5,11,10,12,10,2,15,15,15,5,10,1,12,14,8,5,6,2,26,15,21,15,12,2,8,11,5,5,16,5,2,17,3,2,2,3,15,3,8,10,7,10,3,1,14,14,8,8,8,19,10,12,3,8,2,20,16,10,6,15,6,1,12,12,15,15,8,11,17,7,7,7,3,10,1,5,19,11,7,12,8,12,7,5,10,1,11,1,6,21,1,1,10,3,8,5,6,5,20,25,17,5,2,16,14,11,1,17,10,14,5,16,5,2,7,3,8,17,7,19,12,6,5,1,3,12,43,11,8,11,5,19,10,5,11,7,20,6,12,35,5,3,17,10,2,12,6,5,21,24,15,5,10,3,15,1,12,6,3,17,3,2,3,5,5,14,11,8,1,8,10,5,25,8,7,2,6,3,11,1,11,7,3,10,7,12,10,8,6,1,1,17,3,1,1,2,19,6,10,2,2,7,5,16,3,2,11,10,7,10,21,3,5,2,21,3,14,6,7,2,24,3,17,3,21,8,5,11,17,5,6,10,5,20,1,12,2,3,20,6,11,12,14,6,6,1,14,15,12,15,6,20,7,7,19,3,7,5,16,12,6,7,2,10,3,2,11,8,6,6,5,1,11,1,15,21,14,6,3,2,2,5,6,1,3,5,3,6,20,1,15,12,2,3,3,7,1,16,5,24,10,7,1,12,16,8,26,16,15,10,19,11,6,6,5,6,5)

 & I'd like to know know whether, & how the numbers are clustered.  In an extreme & illustrative example, 1..10 would have zero clusters;  1,1,1,2,2,2,7,7,7 would have 3 clusters (around 1,2 & 7); 17,22,20,45,47,51,82,84,83  would have 3 clusters. (around 20, 47 & 83).  In my set, when I scan it, I intuitively figure there's lots of numbers close to 0 & a lot close to 20 (or there abouts).

I saw info about k-clusters but I'm not sure if I'm going down the right path.  I'm interested in k-clusters & will teach myself, but my priority is working out this problem.

Do you know the name of the algorithm I'm trying to use?  If so, are there python libraries like numpy that I can leverage?  I imagine that I could iterate from 0 to 100% using that as an artificial mean, discard values that are over a standard deviation away, and count the number of scores for that mean; then at the end of that I could set a threshold for which the artificial mean would be kept something like (no attempt at correct syntax:

means={}
deviation=5
threshold=int(0.25*len(list))
for i in range 100:
  count=0
  for j in list:
    if abs(j-i) > deviation:
      count+=1
  if count > threshold:
    means[i]=count

That algorithm is entirely untested & I think it could work, it's just I don't want to reinvent the wheel.  Any ideas kindly appreciated.


--
https://mail.python.org/mailman/listinfo/python-list

Reply via email to