Re: Smarter algo, was Re: 03 digression by brute force

2018-12-15 Thread jfong
Appreciate your thoughtfully analysis on this code. Before generalize it with 
arbitrary additions, as Peter suggested:-), a recursive version is needed. I 
may give it a try on this Sunday.


Avi Gross at 2018/12/15 UTC+8 AM8:13:37 wrote:
> REAL SUBJECT: Analysis of alternate algorithms.
> 
> Peter & Jach and anyone interested,
> 
> As Peter said in his altered subject line, Jack changed directions from 
> tweaking an algorithm to trying something quite different.
> 
> Reminder of the problem. 
> 
> Horizontal View:
> SEND + MORE = MONEY
> 
> Vertical View:
>   SEND
> +MORE
> ...
> MONEY
> 
> Hard to be precise as I am sticking to plain text in my message. The three 
> words above are meant to be right adjusted.
> 
> When solving it horizontally, Jach and I used variants of a brute force 
> approach. Substitute all possible combinations. He did it in-line and used 
> eval. I did it by making lists of items on both sides and summing the int() 
> of each component and comparing. We tried both our algorithms and his was 
> slower and he profiled that the cause was that eval() was much more expensive 
> as were his use of regular expression functions. For comparison, mine used 
> int() and string manipulation functions and sets and dictionaries.
> 
> But the real puzzle is meant to be solved in a more vertical way by humans 
> using logic. I won't do that here but note we have 4 equations going down but 
> 8 unknowns. And the equations are not unique.
> 
> The rightmost column (I will call it the first as our arithmetic proceeds 
> from right to left) is meant to represent ONES and provides the equation:
> 
> (D+E== Y) or (D+E == Y + 10)
> 
> Worse, for the next column, you either get a "1" carried from the previous 
> addition or not and either pass a "1" along to the next column or not. 4 
> Possibilities.
> 
> (N+R==E) or (N+R+1==E) or (N+R==E+10) or (N+R+1==E+10)
> 
> Getting a headache yet?
> 
> For a human, they need a way to come up with additional info in terms of 
> constraints.
> 
> There is a guiding inequality that looks like this:
> 
> S is not the same as any of the others. Anytime you solve for another, the 
> list of possible values for S shrinks.
> Ditto for each other variable.
> Or, since anything plus 0 is itself, then D and E adding up to Y (with no 
> possible carry) cannot be 0.
> 
> But getting a computer to do this might be a bit harder than blunt-force 
> searches. So let's see why Jach's new algorithm was faster.
> 
> The code I am analyzing can be viewed in the archives and will not be entered 
> again:
> 
> https://mail.python.org/pipermail/python-list/2018-December/738454.html
> 
> So what did Jach try in his newer version? It is what I call a vertical 
> approach but one a computer can do easier than a human can or would. I see it 
> is a very specific algorithm that hard codes in these variables as global 
> entities that are altered by a set of nested functions. S, E, N, D, M, O, R, 
> Y. There are approaches that might be better such as passing a dictionary 
> partially filled out from one function to the next as the only one that 
> prints the solution is the final function call.
> 
> So his is not a general solution.
> 
> What interests me as a probable reason this is faster is the number of 
> situations it tries. The earlier versions asked itertools.permutations() to 
> provide all unique combinations of ten tokens in eight positions. So there 
> were 10 choices for the first and 9 for the second and so on adding up to 
> 10!/2! or 1,814,400  different combinations tried. That approaches 2 million.
> 
> Jack broke the problem down into evaluating the units column with a loop like 
> this:
> 
> itertools.permutations(range(10), 3)
> 
> That is 720 possibilities. He then doubled that to 1,440 to consider a carry. 
> Only if the selected values for the three variables in contention (plus a 
> carry) does he go on to call to evaluate the tens column.
> 
> It then shrinks a bit more as he no longer gets the permutations of all 10 
> digits. He subtracts the three values that might work for the units, so he is 
> asking for permutations of 7 digits, two at a time. That is 42, doubled again 
> to 84 for carry purposes. And this function is not called 1,440 times, but 
> quite a bit fewer. 
> 
> So, similarly, of those 84 loops for tens, he only sometimes calls to 
> evaluate hundreds. As mentioned, the set of 10 digits shrinks some more and 
> this continues upward to functions that evaluate hundreds and thousands  and 
> finally the one evaluating ten thousands pretty much prints out an answer it 
> finds. 
> 
> So overall iterations can be shown to drop. We could add code to measure how 
> many times each function is called and come up with an exact value for this 
> built-in problem. I did and the functions were called this many times:
> 
> >>> counting
> {'unit': 1, 'ten': 72, 'hundred': 290, 'thou': 183, '10thou': 196}
> >>> sum(counting.values())
> 742
> 
> But I a

Re: Smarter algo, was Re: 03 digression by brute force

2018-12-15 Thread BlindAnagram
On 15/12/2018 09:56, jf...@ms4.hinet.net wrote:
> Appreciate your thoughtfully analysis on this code. Before generalize it with 
> arbitrary additions, as Peter suggested:-), a recursive version is needed. I 
> may give it a try on this Sunday.
> 
> 
> Avi Gross at 2018/12/15 UTC+8 AM8:13:37 wrote:
>> REAL SUBJECT: Analysis of alternate algorithms.
>>
>> Peter & Jach and anyone interested,
>>
>> As Peter said in his altered subject line, Jack changed directions from 
>> tweaking an algorithm to trying something quite different.
>>
>> Reminder of the problem. 
>>
>> Horizontal View:
>> SEND + MORE = MONEY
>>
>> Vertical View:
>>   SEND
>> +MORE
>> ...
>> MONEY
>>
>> Hard to be precise as I am sticking to plain text in my message. The three 
>> words above are meant to be right adjusted.
>>
>> When solving it horizontally, Jach and I used variants of a brute force 
>> approach. Substitute all possible combinations. He did it in-line and used 
>> eval. I did it by making lists of items on both sides and summing the int() 
>> of each component and comparing. We tried both our algorithms and his was 
>> slower and he profiled that the cause was that eval() was much more 
>> expensive as were his use of regular expression functions. For comparison, 
>> mine used int() and string manipulation functions and sets and dictionaries.
>>
>> But the real puzzle is meant to be solved in a more vertical way by humans 
>> using logic. I won't do that here but note we have 4 equations going down 
>> but 8 unknowns. And the equations are not unique.
>>
>> The rightmost column (I will call it the first as our arithmetic proceeds 
>> from right to left) is meant to represent ONES and provides the equation:
>>
>> (D+E== Y) or (D+E == Y + 10)
>>
>> Worse, for the next column, you either get a "1" carried from the previous 
>> addition or not and either pass a "1" along to the next column or not. 4 
>> Possibilities.
>>
>> (N+R==E) or (N+R+1==E) or (N+R==E+10) or (N+R+1==E+10)
>>
>> Getting a headache yet?
>>
>> For a human, they need a way to come up with additional info in terms of 
>> constraints.
>>
>> There is a guiding inequality that looks like this:
>>
>> S is not the same as any of the others. Anytime you solve for another, the 
>> list of possible values for S shrinks.
>> Ditto for each other variable.
>> Or, since anything plus 0 is itself, then D and E adding up to Y (with no 
>> possible carry) cannot be 0.
>>
>> But getting a computer to do this might be a bit harder than blunt-force 
>> searches. So let's see why Jach's new algorithm was faster.
>>
>> The code I am analyzing can be viewed in the archives and will not be 
>> entered again:
>>
>> https://mail.python.org/pipermail/python-list/2018-December/738454.html
>>
>> So what did Jach try in his newer version? It is what I call a vertical 
>> approach but one a computer can do easier than a human can or would. I see 
>> it is a very specific algorithm that hard codes in these variables as global 
>> entities that are altered by a set of nested functions. S, E, N, D, M, O, R, 
>> Y. There are approaches that might be better such as passing a dictionary 
>> partially filled out from one function to the next as the only one that 
>> prints the solution is the final function call.
>>
>> So his is not a general solution.
>>
>> What interests me as a probable reason this is faster is the number of 
>> situations it tries. The earlier versions asked itertools.permutations() to 
>> provide all unique combinations of ten tokens in eight positions. So there 
>> were 10 choices for the first and 9 for the second and so on adding up to 
>> 10!/2! or 1,814,400  different combinations tried. That approaches 2 million.
>>
>> Jack broke the problem down into evaluating the units column with a loop 
>> like this:
>>
>> itertools.permutations(range(10), 3)
>>
>> That is 720 possibilities. He then doubled that to 1,440 to consider a 
>> carry. Only if the selected values for the three variables in contention 
>> (plus a carry) does he go on to call to evaluate the tens column.
>>
>> It then shrinks a bit more as he no longer gets the permutations of all 10 
>> digits. He subtracts the three values that might work for the units, so he 
>> is asking for permutations of 7 digits, two at a time. That is 42, doubled 
>> again to 84 for carry purposes. And this function is not called 1,440 times, 
>> but quite a bit fewer. 
>>
>> So, similarly, of those 84 loops for tens, he only sometimes calls to 
>> evaluate hundreds. As mentioned, the set of 10 digits shrinks some more and 
>> this continues upward to functions that evaluate hundreds and thousands  and 
>> finally the one evaluating ten thousands pretty much prints out an answer it 
>> finds. 
>>
>> So overall iterations can be shown to drop. We could add code to measure how 
>> many times each function is called and come up with an exact value for this 
>> built-in problem. I did and the functions were called this many time

clusters of numbers

2018-12-15 Thread Marc Lucke

hey guys,

I have a hobby project that sorts my email automatically for me & I want 
to improve it.  There's data science and statistical info that I'm 
missing, & I always enjoy reading about the pythonic way to do things too.


I have a list of percentage scores:

(1,11,1,7,5,7,2,2,2,10,10,1,2,2,1,7,2,1,7,5,3,8,2,6,3,2,7,2,12,3,1,2,19,3,5,1,1,7,8,8,1,5,6,7,3,14,6,1,6,7,6,15,6,3,7,2,6,23,2,7,1,21,21,8,8,3,2,20,1,3,12,3,1,2,10,16,16,15,6,5,3,2,2,11,1,14,6,3,7,1,5,3,3,14,3,7,3,5,8,3,6,17,1,1,7,3,1,2,6,1,7,7,12,6,6,2,1,6,3,6,2,1,5,1,8,10,2,6,1,7,3,5,7,7,5,7,2,5,1,19,19,1,12,5,10,2,19,1,3,19,6,1,5,11,2,1,2,5,2,5,8,2,2,2,5,3,1,21,2,3,7,10,1,8,1,3,17,17,1,5,3,10,14,1,2,14,14,1,15,6,3,2,17,17,1,1,1,2,2,3,3,2,2,7,7,2,1,2,8,2,20,3,2,3,12,7,6,5,12,2,3,11,3,1,1,8,16,10,1,6,6,6,11,1,6,5,2,5,11,1,2,10,6,14,6,3,3,5,2,6,17,15,1,2,2,17,5,3,3,5,8,1,6,3,14,3,2,1,7,2,8,11,5,14,3,19,1,3,7,3,3,8,8,6,1,3,1,14,14,10,3,2,1,12,2,3,1,2,2,6,6,7,10,10,12,24,1,21,21,5,11,12,12,2,1,19,8,6,2,1,1,19,10,6,2,15,15,7,10,14,12,14,5,11,7,12,2,1,14,10,7,10,3,17,25,10,5,5,3,12,5,2,14,5,8,1,11,5,29,2,7,20,12,14,1,10,6,17,16,6,7,11,12,3,1,23,11,10,11,5,10,6,2,17,15,20,5,10,1,17,3,7,15,5,11,6,19,14,15,7,1,2,17,8,15,10,26,6,1,2,10,6,14,12,6,1,16,6,12,10,10,14,1,6,1,6,6,12,6,6,1,2,5,10,8,10,1,6,8,17,11,6,3,6,5,1,2,1,2,6,6,12,14,7,1,7,1,8,2,3,14,11,6,3,11,3,1,6,17,12,8,2,10,3,12,12,2,7,5,5,17,2,5,10,12,21,15,6,10,10,7,15,11,2,7,10,3,1,2,7,10,15,1,1,6,5,5,3,17,19,7,1,15,2,8,7,1,6,2,1,15,19,7,15,1,8,3,3,20,8,1,11,7,8,7,1,12,11,1,10,17,2,23,3,7,20,20,3,11,5,1,1,8,1,6,2,11,1,5,1,10,7,20,17,8,1,2,10,6,2,1,23,11,11,7,2,21,5,5,8,1,1,10,12,15,2,1,10,5,2,2,5,1,2,11,10,1,8,10,12,2,12,2,8,6,19,15,8,2,16,7,5,14,2,1,3,3,10,16,20,5,8,14,8,3,14,2,1,5,16,16,2,10,8,17,17,10,10,11,3,5,1,17,17,3,17,5,6,7,7,12,19,15,20,11,10,2,6,6,5,5,1,16,16,8,7,2,1,3,5,20,20,6,7,5,23,14,3,10,2,2,7,10,10,3,5,5,8,14,11,14,14,11,19,5,5,2,12,25,5,2,11,8,10,5,11,10,12,10,2,15,15,15,5,10,1,12,14,8,5,6,2,26,15,21,15,12,2,8,11,5,5,16,5,2,17,3,2,2,3,15,3,8,10,7,10,3,1,14,14,8,8,8,19,10,12,3,8,2,20,16,10,6,15,6,1,12,12,15,15,8,11,17,7,7,7,3,10,1,5,19,11,7,12,8,12,7,5,10,1,11,1,6,21,1,1,10,3,8,5,6,5,20,25,17,5,2,16,14,11,1,17,10,14,5,16,5,2,7,3,8,17,7,19,12,6,5,1,3,12,43,11,8,11,5,19,10,5,11,7,20,6,12,35,5,3,17,10,2,12,6,5,21,24,15,5,10,3,15,1,12,6,3,17,3,2,3,5,5,14,11,8,1,8,10,5,25,8,7,2,6,3,11,1,11,7,3,10,7,12,10,8,6,1,1,17,3,1,1,2,19,6,10,2,2,7,5,16,3,2,11,10,7,10,21,3,5,2,21,3,14,6,7,2,24,3,17,3,21,8,5,11,17,5,6,10,5,20,1,12,2,3,20,6,11,12,14,6,6,1,14,15,12,15,6,20,7,7,19,3,7,5,16,12,6,7,2,10,3,2,11,8,6,6,5,1,11,1,15,21,14,6,3,2,2,5,6,1,3,5,3,6,20,1,15,12,2,3,3,7,1,16,5,24,10,7,1,12,16,8,26,16,15,10,19,11,6,6,5,6,5)

 & I'd like to know know whether, & how the numbers are clustered.  In 
an extreme & illustrative example, 1..10 would have zero clusters;  
1,1,1,2,2,2,7,7,7 would have 3 clusters (around 1,2 & 7); 
17,22,20,45,47,51,82,84,83  would have 3 clusters. (around 20, 47 & 
83).  In my set, when I scan it, I intuitively figure there's lots of 
numbers close to 0 & a lot close to 20 (or there abouts).


I saw info about k-clusters but I'm not sure if I'm going down the right 
path.  I'm interested in k-clusters & will teach myself, but my priority 
is working out this problem.


Do you know the name of the algorithm I'm trying to use?  If so, are 
there python libraries like numpy that I can leverage?  I imagine that I 
could iterate from 0 to 100% using that as an artificial mean, discard 
values that are over a standard deviation away, and count the number of 
scores for that mean; then at the end of that I could set a threshold 
for which the artificial mean would be kept something like (no attempt 
at correct syntax:


means={}
deviation=5
threshold=int(0.25*len(list))
for i in range 100:
  count=0
  for j in list:
    if abs(j-i) > deviation:
  count+=1
  if count > threshold:
    means[i]=count

That algorithm is entirely untested & I think it could work, it's just I 
don't want to reinvent the wheel.  Any ideas kindly appreciated.



--
https://mail.python.org/mailman/listinfo/python-list


Re: clusters of numbers

2018-12-15 Thread Oscar Benjamin
On Sun, 16 Dec 2018 at 01:47, Marc Lucke  wrote:
>
> hey guys,
>
> I have a hobby project that sorts my email automatically for me & I want
> to improve it.  There's data science and statistical info that I'm
> missing, & I always enjoy reading about the pythonic way to do things too.
>
> I have a list of percentage scores:
>
> (1,11,1,7,5,7,2,2,2,10,10,1,2,2,1,7,2,1,7,5,3,8,2,6,3,2,7,2,12,3,1,2,19,3,5,1,1,7,8,8,1,5,6,7,3,14,6,1,6,7,6,15,6,3,7,2,6,23,2,7,1,21,21,8,8,3,2,20,1,3,12,3,1,2,10,16,16,15,6,5,3,2,2,11,1,14,6,3,7,1,5,3,3,14,3,7,3,5,8,3,6,17,1,1,7,3,1,2,6,1,7,7,12,6,6,2,1,6,3,6,2,1,5,1,8,10,2,6,1,7,3,5,7,7,5,7,2,5,1,19,19,1,12,5,10,2,19,1,3,19,6,1,5,11,2,1,2,5,2,5,8,2,2,2,5,3,1,21,2,3,7,10,1,8,1,3,17,17,1,5,3,10,14,1,2,14,14,1,15,6,3,2,17,17,1,1,1,2,2,3,3,2,2,7,7,2,1,2,8,2,20,3,2,3,12,7,6,5,12,2,3,11,3,1,1,8,16,10,1,6,6,6,11,1,6,5,2,5,11,1,2,10,6,14,6,3,3,5,2,6,17,15,1,2,2,17,5,3,3,5,8,1,6,3,14,3,2,1,7,2,8,11,5,14,3,19,1,3,7,3,3,8,8,6,1,3,1,14,14,10,3,2,1,12,2,3,1,2,2,6,6,7,10,10,12,24,1,21,21,5,11,12,12,2,1,19,8,6,2,1,1,19,10,6,2,15,15,7,10,14,12,14,5,11,7,12,2,1,14,10,7,10,3,17,25,10,5,5,3,12,5,2,14,5,8,1,11,5,29,2,7,20,12,14,1,10,6,17,16,6,7,11,12,3,1,23,11,10,11,5,10,6,2,17,15,20,5,10,1,17,3,7,15,5,11,6,19,14,15,7,1,2,17,8,15,10,26,6,1,2,10,6,14,12,6,1,16,6,12,10,10,14,1,6,1,6,6,12,6,6,1,2,5,10
 
,8,10,1,6,8,17,11,6,3,6,5,1,2,1,2,6,6,12,14,7,1,7,1,8,2,3,14,11,6,3,11,3,1,6,17,12,8,2,10,3,12,12,2,7,5,5,17,2,5,10,12,21,15,6,10,10,7,15,11,2,7,10,3,1,2,7,10,15,1,1,6,5,5,3,17,19,7,1,15,2,8,7,1,6,2,1,15,19,7,15,1,8,3,3,20,8,1,11,7,8,7,1,12,11,1,10,17,2,23,3,7,20,20,3,11,5,1,1,8,1,6,2,11,1,5,1,10,7,20,17,8,1,2,10,6,2,1,23,11,11,7,2,21,5,5,8,1,1,10,12,15,2,1,10,5,2,2,5,1,2,11,10,1,8,10,12,2,12,2,8,6,19,15,8,2,16,7,5,14,2,1,3,3,10,16,20,5,8,14,8,3,14,2,1,5,16,16,2,10,8,17,17,10,10,11,3,5,1,17,17,3,17,5,6,7,7,12,19,15,20,11,10,2,6,6,5,5,1,16,16,8,7,2,1,3,5,20,20,6,7,5,23,14,3,10,2,2,7,10,10,3,5,5,8,14,11,14,14,11,19,5,5,2,12,25,5,2,11,8,10,5,11,10,12,10,2,15,15,15,5,10,1,12,14,8,5,6,2,26,15,21,15,12,2,8,11,5,5,16,5,2,17,3,2,2,3,15,3,8,10,7,10,3,1,14,14,8,8,8,19,10,12,3,8,2,20,16,10,6,15,6,1,12,12,15,15,8,11,17,7,7,7,3,10,1,5,19,11,7,12,8,12,7,5,10,1,11,1,6,21,1,1,10,3,8,5,6,5,20,25,17,5,2,16,14,11,1,17,10,14,5,16,5,2,7,3,8,17,7,19,12,6,5,1,3,12,43,11,8,11,5,19,10,5,11,7,20,6,12,35,5,3,
 
17,10,2,12,6,5,21,24,15,5,10,3,15,1,12,6,3,17,3,2,3,5,5,14,11,8,1,8,10,5,25,8,7,2,6,3,11,1,11,7,3,10,7,12,10,8,6,1,1,17,3,1,1,2,19,6,10,2,2,7,5,16,3,2,11,10,7,10,21,3,5,2,21,3,14,6,7,2,24,3,17,3,21,8,5,11,17,5,6,10,5,20,1,12,2,3,20,6,11,12,14,6,6,1,14,15,12,15,6,20,7,7,19,3,7,5,16,12,6,7,2,10,3,2,11,8,6,6,5,1,11,1,15,21,14,6,3,2,2,5,6,1,3,5,3,6,20,1,15,12,2,3,3,7,1,16,5,24,10,7,1,12,16,8,26,16,15,10,19,11,6,6,5,6,5)
>
>   & I'd like to know know whether, & how the numbers are clustered.  In
> an extreme & illustrative example, 1..10 would have zero clusters;
> 1,1,1,2,2,2,7,7,7 would have 3 clusters (around 1,2 & 7);
> 17,22,20,45,47,51,82,84,83  would have 3 clusters. (around 20, 47 &
> 83).  In my set, when I scan it, I intuitively figure there's lots of
> numbers close to 0 & a lot close to 20 (or there abouts).
>
> I saw info about k-clusters but I'm not sure if I'm going down the right
> path.  I'm interested in k-clusters & will teach myself, but my priority
> is working out this problem.
>
> Do you know the name of the algorithm I'm trying to use?

I don't recognise the algorithm in your code but when you say
k-clusters I assume you mean k-means:
https://en.wikipedia.org/wiki/K-means_clustering

Most discussions of k-means will assume that you are working in at
least 2 dimensions but in your case your data is 1D so bear that in
mind when comparing.

It's not hard to implement k-means yourself but I believe that
scikit-learn already has it:
https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html

--
Oscar
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: clusters of numbers

2018-12-15 Thread Vincent Davis
Why not start with a histogram.

Vincent

On Sat, Dec 15, 2018 at 6:46 PM Marc Lucke  wrote:

> hey guys,
>
> I have a hobby project that sorts my email automatically for me & I want
> to improve it.  There's data science and statistical info that I'm
> missing, & I always enjoy reading about the pythonic way to do things too.
>
> I have a list of percentage scores:
>
>
> (1,11,1,7,5,7,2,2,2,10,10,1,2,2,1,7,2,1,7,5,3,8,2,6,3,2,7,2,12,3,1,2,19,3,5,1,1,7,8,8,1,5,6,7,3,14,6,1,6,7,6,15,6,3,7,2,6,23,2,7,1,21,21,8,8,3,2,20,1,3,12,3,1,2,10,16,16,15,6,5,3,2,2,11,1,14,6,3,7,1,5,3,3,14,3,7,3,5,8,3,6,17,1,1,7,3,1,2,6,1,7,7,12,6,6,2,1,6,3,6,2,1,5,1,8,10,2,6,1,7,3,5,7,7,5,7,2,5,1,19,19,1,12,5,10,2,19,1,3,19,6,1,5,11,2,1,2,5,2,5,8,2,2,2,5,3,1,21,2,3,7,10,1,8,1,3,17,17,1,5,3,10,14,1,2,14,14,1,15,6,3,2,17,17,1,1,1,2,2,3,3,2,2,7,7,2,1,2,8,2,20,3,2,3,12,7,6,5,12,2,3,11,3,1,1,8,16,10,1,6,6,6,11,1,6,5,2,5,11,1,2,10,6,14,6,3,3,5,2,6,17,15,1,2,2,17,5,3,3,5,8,1,6,3,14,3,2,1,7,2,8,11,5,14,3,19,1,3,7,3,3,8,8,6,1,3,1,14,14,10,3,2,1,12,2,3,1,2,2,6,6,7,10,10,12,24,1,21,21,5,11,12,12,2,1,19,8,6,2,1,1,19,10,6,2,15,15,7,10,14,12,14,5,11,7,12,2,1,14,10,7,10,3,17,25,10,5,5,3,12,5,2,14,5,8,1,11,5,29,2,7,20,12,14,1,10,6,17,16,6,7,11,12,3,1,23,11,10,11,5,10,6,2,17,15,20,5,10,1,17,3,7,15,5,11,6,19,14,15,7,1,2,17,8,15,10,26,6,1,2,10,6,14,12,6,1,16,6,12,10,10,14,1,6,1,6,6,12,6,6,1,2,5,10
 
,8,10,1,6,8,17,11,6,3,6,5,1,2,1,2,6,6,12,14,7,1,7,1,8,2,3,14,11,6,3,11,3,1,6,17,12,8,2,10,3,12,12,2,7,5,5,17,2,5,10,12,21,15,6,10,10,7,15,11,2,7,10,3,1,2,7,10,15,1,1,6,5,5,3,17,19,7,1,15,2,8,7,1,6,2,1,15,19,7,15,1,8,3,3,20,8,1,11,7,8,7,1,12,11,1,10,17,2,23,3,7,20,20,3,11,5,1,1,8,1,6,2,11,1,5,1,10,7,20,17,8,1,2,10,6,2,1,23,11,11,7,2,21,5,5,8,1,1,10,12,15,2,1,10,5,2,2,5,1,2,11,10,1,8,10,12,2,12,2,8,6,19,15,8,2,16,7,5,14,2,1,3,3,10,16,20,5,8,14,8,3,14,2,1,5,16,16,2,10,8,17,17,10,10,11,3,5,1,17,17,3,17,5,6,7,7,12,19,15,20,11,10,2,6,6,5,5,1,16,16,8,7,2,1,3,5,20,20,6,7,5,23,14,3,10,2,2,7,10,10,3,5,5,8,14,11,14,14,11,19,5,5,2,12,25,5,2,11,8,10,5,11,10,12,10,2,15,15,15,5,10,1,12,14,8,5,6,2,26,15,21,15,12,2,8,11,5,5,16,5,2,17,3,2,2,3,15,3,8,10,7,10,3,1,14,14,8,8,8,19,10,12,3,8,2,20,16,10,6,15,6,1,12,12,15,15,8,11,17,7,7,7,3,10,1,5,19,11,7,12,8,12,7,5,10,1,11,1,6,21,1,1,10,3,8,5,6,5,20,25,17,5,2,16,14,11,1,17,10,14,5,16,5,2,7,3,8,17,7,19,12,6,5,1,3,12,43,11,8,11,5,19,10,5,11,7,20,6,12,35,5,3,
 
17,10,2,12,6,5,21,24,15,5,10,3,15,1,12,6,3,17,3,2,3,5,5,14,11,8,1,8,10,5,25,8,7,2,6,3,11,1,11,7,3,10,7,12,10,8,6,1,1,17,3,1,1,2,19,6,10,2,2,7,5,16,3,2,11,10,7,10,21,3,5,2,21,3,14,6,7,2,24,3,17,3,21,8,5,11,17,5,6,10,5,20,1,12,2,3,20,6,11,12,14,6,6,1,14,15,12,15,6,20,7,7,19,3,7,5,16,12,6,7,2,10,3,2,11,8,6,6,5,1,11,1,15,21,14,6,3,2,2,5,6,1,3,5,3,6,20,1,15,12,2,3,3,7,1,16,5,24,10,7,1,12,16,8,26,16,15,10,19,11,6,6,5,6,5)
>
>   & I'd like to know know whether, & how the numbers are clustered.  In
> an extreme & illustrative example, 1..10 would have zero clusters;
> 1,1,1,2,2,2,7,7,7 would have 3 clusters (around 1,2 & 7);
> 17,22,20,45,47,51,82,84,83  would have 3 clusters. (around 20, 47 &
> 83).  In my set, when I scan it, I intuitively figure there's lots of
> numbers close to 0 & a lot close to 20 (or there abouts).
>
> I saw info about k-clusters but I'm not sure if I'm going down the right
> path.  I'm interested in k-clusters & will teach myself, but my priority
> is working out this problem.
>
> Do you know the name of the algorithm I'm trying to use?  If so, are
> there python libraries like numpy that I can leverage?  I imagine that I
> could iterate from 0 to 100% using that as an artificial mean, discard
> values that are over a standard deviation away, and count the number of
> scores for that mean; then at the end of that I could set a threshold
> for which the artificial mean would be kept something like (no attempt
> at correct syntax:
>
> means={}
> deviation=5
> threshold=int(0.25*len(list))
> for i in range 100:
>count=0
>for j in list:
>  if abs(j-i) > deviation:
>count+=1
>if count > threshold:
>  means[i]=count
>
> That algorithm is entirely untested & I think it could work, it's just I
> don't want to reinvent the wheel.  Any ideas kindly appreciated.
>
>
> --
> https://mail.python.org/mailman/listinfo/python-list
>
-- 
https://mail.python.org/mailman/listinfo/python-list


RE: clusters of numbers

2018-12-15 Thread Avi Gross



-Original Message-
From: Avi Gross  
Sent: Saturday, December 15, 2018 11:27 PM
To: 'Marc Lucke' 
Subject: RE: clusters of numbers

Marc,

There are k-means implementations in python, R and other places. Most uses 
would have two or more dimensions with a goal of specifying how many clusters 
to look for and then it iterates starting with random existing points to 
cluster things near those points and then near the centers of those clusters 
until things stabilize.

Your data is 1-D. Something simpler like a bar chart makes sense. But that may 
not show underlying patterns.

I am more familiar with doing graphics in R but you can see a tabular view of 
your data:

data
  1   2   3   5   6   7   8  10  11  12  14  15  16  17  19  20  21  23  24  25 
 26  29  35  43 
124 116  97  95  89  74  57  73  48  49  38  35  20  33  21  19  14   5   4   4 
  3   1   1   1

There are clear gaps and a bar chart (which I cannot attach but could send in 
private email) does show clusters visibly.

But those may largely be an artifact of the missing info.

If you tell us more, we might be able to provide a better statistical answer. I 
assume you know how to get means and so on.

   varsn mean   sd median trimmed  mad min max range skew kurtosis   se
X11 1021 7.82 6.01  67.12 5.93   1  4342 1.04 1.23 0.19

Yes, the above is hard to read as I cannot use tables or a constant width font 
in this forum.

I ran a kmeans asking for 3 clusters:

1 16.512097
2  1.919881
3  7.433486

The three clusters had these scores in them:

Cluster 1: 5  6  7  8 10 11
Cluster 2:  1 2 3
Cluster 3: 12 14 15 16 17 19 20 21 23 24 25 26 29 35 43

If I run it asking for say 5 clusters:

Centers:

1  6.295238
2 11.432692
3  1.48
4  3.00
5 18.478261

And here are your five clusters:

5 6 7 8
10 11 12 14
1 2
3
15 16 17 19 20 21 23 24 25 26 29 35 43

If you ran this for various numbers, you might see one that makes more sense to 
you.  Or, maybe not.

We culd tell you what functions to use but if you search using keywords like 
python (or another language) followed by k-means or kmeans you can fid out what 
to install and use. In python, you would need Numpy and probably SciPy as well 
as the sklearn modules with the Kmeans function in sklearn.clusters. Note you 
can fine tune the algorithm multiple ways or run it several times as the 
results can depend on the initial guesses. And you may want to be able to make 
graphics showing the clusters, albeit it is 1-D.

Good luck.


-Original Message-
From: Python-list  On 
Behalf Of Marc Lucke
Sent: Saturday, December 15, 2018 7:55 PM
To: python-list@python.org
Subject: clusters of numbers

hey guys,

I have a hobby project that sorts my email automatically for me & I want to 
improve it.  There's data science and statistical info that I'm missing, & I 
always enjoy reading about the pythonic way to do things too.

I have a list of percentage scores:

(1,11,1,7,5,7,2,2,2,10,10,1,2,2,1,7,2,1,7,5,3,8,2,6,3,2,7,2,12,3,1,2,19,3,5,1,1,7,8,8,1,5,6,7,3,14,6,1,6,7,6,15,6,3,7,2,6,23,2,7,1,21,21,8,8,3,2,20,1,3,12,3,1,2,10,16,16,15,6,5,3,2,2,11,1,14,6,3,7,1,5,3,3,14,3,7,3,5,8,3,6,17,1,1,7,3,1,2,6,1,7,7,12,6,6,2,1,6,3,6,2,1,5,1,8,10,2,6,1,7,3,5,7,7,5,7,2,5,1,19,19,1,12,5,10,2,19,1,3,19,6,1,5,11,2,1,2,5,2,5,8,2,2,2,5,3,1,21,2,3,7,10,1,8,1,3,17,17,1,5,3,10,14,1,2,14,14,1,15,6,3,2,17,17,1,1,1,2,2,3,3,2,2,7,7,2,1,2,8,2,20,3,2,3,12,7,6,5,12,2,3,11,3,1,1,8,16,10,1,6,6,6,11,1,6,5,2,5,11,1,2,10,6,14,6,3,3,5,2,6,17,15,1,2,2,17,5,3,3,5,8,1,6,3,14,3,2,1,7,2,8,11,5,14,3,19,1,3,7,3,3,8,8,6,1,3,1,14,14,10,3,2,1,12,2,3,1,2,2,6,6,7,10,10,12,24,1,21,21,5,11,12,12,2,1,19,8,6,2,1,1,19,10,6,2,15,15,7,10,14,12,14,5,11,7,12,2,1,14,10,7,10,3,17,25,10,5,5,3,12,5,2,14,5,8,1,11,5,29,2,7,20,12,14,1,10,6,17,16,6,7,11,12,3,1,23,11,10,11,5,10,6,2,17,15,20,5,10,1,17,3,7,15,5,11,6,19,14,15,7,1,2,17,8,15,10,26,6,1,2,10,6,14,12,6,1,16,6,12,10,10,14,1,6,1,6,6,12,6,6,1,2,5,10,8
 
,10,1,6,8,17,11,6,3,6,5,1,2,1,2,6,6,12,14,7,1,7,1,8,2,3,14,11,6,3,11,3,1,6,17,12,8,2,10,3,12,12,2,7,5,5,17,2,5,10,12,21,15,6,10,10,7,15,11,2,7,10,3,1,2,7,10,15,1,1,6,5,5,3,17,19,7,1,15,2,8,7,1,6,2,1,15,19,7,15,1,8,3,3,20,8,1,11,7,8,7,1,12,11,1,10,17,2,23,3,7,20,20,3,11,5,1,1,8,1,6,2,11,1,5,1,10,7,20,17,8,1,2,10,6,2,1,23,11,11,7,2,21,5,5,8,1,1,10,12,15,2,1,10,5,2,2,5,1,2,11,10,1,8,10,12,2,12,2,8,6,19,15,8,2,16,7,5,14,2,1,3,3,10,16,20,5,8,14,8,3,14,2,1,5,16,16,2,10,8,17,17,10,10,11,3,5,1,17,17,3,17,5,6,7,7,12,19,15,20,11,10,2,6,6,5,5,1,16,16,8,7,2,1,3,5,20,20,6,7,5,23,14,3,10,2,2,7,10,10,3,5,5,8,14,11,14,14,11,19,5,5,2,12,25,5,2,11,8,10,5,11,10,12,10,2,15,15,15,5,10,1,12,14,8,5,6,2,26,15,21,15,12,2,8,11,5,5,16,5,2,17,3,2,2,3,15,3,8,10,7,10,3,1,14,14,8,8,8,19,10,12,3,8,2,20,16,10,6,15,6,1,12,12,15,15,8,11,17,7,7,7,3,10,1,5,19,11,7,12,8,12,7,5,10,1,11,1,6,21,1,1,10,3,8,5,6,5,20,25,17,5,2,16,14,11,1,17,10,14,5,16,5,2,7,3,8,17,7,19,12,6,5,1,3,12,43,11,8,11,5,19,10,5,11,7,20,6,12,35,5,3,17
 
,10,2,12,6,5,21,24,15,5,10,3,15,1,12,6,3,17,3,

Re: clusters of numbers

2018-12-15 Thread Shakti Kumar
On Sun, 16 Dec 2018 at 09:49, Vincent Davis  wrote:
>
> Why not start with a histogram.
>
> Vincent
>
> On Sat, Dec 15, 2018 at 6:46 PM Marc Lucke  wrote:
>
> > hey guys,
> >
> > I have a hobby project that sorts my email automatically for me & I want
> > to improve it.  There's data science and statistical info that I'm
> > missing, & I always enjoy reading about the pythonic way to do things too.
> >
> > I have a list of percentage scores:

[clipped for brevity]

> > That algorithm is entirely untested & I think it could work, it's just I
> > don't want to reinvent the wheel.  Any ideas kindly appreciated.
> >
> >
> > --
> > https://mail.python.org/mailman/listinfo/python-list
> >
> --
> https://mail.python.org/mailman/listinfo/python-list

+1 for k means certainly.
Also k means in 1D will be like a simple distance comparison and
assignment. A quick Google will give you the exact codes for doing so.
It will be you yourself who will decide how many clusters you want, as
Avi has rightly pointed out.


--
/Shakti
-- 
https://mail.python.org/mailman/listinfo/python-list