On Thursday, November 3, 2016 at 3:47:41 PM UTC-7, jlad...@itu.edu wrote:
> On Thursday, November 3, 2016 at 1:09:48 PM UTC-7, Neil D. Cerutti wrote:
> > you may also be 
> > able to use some items "off the shelf" from Python's difflib.
> 
> I wasn't aware of that module, thanks for the tip!
> 
> difflib.SequenceMatcher.ratio() returns a numerical value which represents 
> the "similarity" between two strings.  I don't see a precise definition of 
> "similar", but it may do what the OP needs.

Following up to myself... I just experimented with 
difflib.SequenceMatcher.ratio() and discovered something.  The algorithm is not 
"commutative."  That is, it doesn't ALWAYS produce the same ratio when the two 
strings are swapped.

Here's an excerpt from my interpreter session.

==========

In [1]: from difflib import SequenceMatcher

In [2]: import numpy as np

In [3]: sim = np.zeros((4,4))


== snip ==


In [10]: strings
Out[10]: 
('Here is a string.',
 'Here is a slightly different string.',
 'This string should be significantly different from the other two?',
 "Let's look at all these string similarity values in a matrix.")

In [11]: for r, s1 in enumerate(strings):
   ....:     for c, s2 in enumerate(strings):
   ....:         m = SequenceMatcher(lambda x:x=="", s1, s2)
   ....:         sim[r,c] = m.ratio()
   ....:

In [12]: sim
Out[12]: 
array([[ 1.        ,  0.64150943,  0.2195122 ,  0.30769231],
       [ 0.64150943,  1.        ,  0.47524752,  0.30927835],
       [ 0.2195122 ,  0.45544554,  1.        ,  0.28571429],
       [ 0.30769231,  0.28865979,  0.33333333,  1.        ]])

==========

The values along the matrix diagonal, of course, are all ones, because each 
string was compared to itself.

I also expected the values reflected across the matrix diagonal to match.  The 
first row does in fact match the first column.  The remaining numbers disagree 
somewhat.  The differences are not large, but they are there.  I don't know the 
reason why.  Caveat programmer.
-- 
https://mail.python.org/mailman/listinfo/python-list

Reply via email to