Jarow-Winkler algorithm: Measuring similarity between strings

2008-12-19 Thread Øyvind
Based on examples and formulas from http://en.wikipedia.org/wiki/Jaro-Winkler.
Useful for measuring similarity between two strings. For example if
you want to detect that the user did a typo.



def jarow(s1,s2):

"""  Returns a number between 1 and 0, where 1 is the most similar

example:

print jarow("martha","marhta")

"""
m= jarow_m(s1,s2)
t1 = jarow_t(s1,s2)
t2 = jarow_t(s2,s1)
t = float(t1)/float(t2)

d = 0.1

# this is the jaro-distance
d_j = 1.0/3.0 * ((m/len(s1)) + (m/len(s2)) + ((m - t)/float(m)))

# if the strings are prefixed similar, they are weighted more
heavily
l = winkler_l(s1,s2)
print l
return d_j + (l * 0.1 * (1 - d_j))

def winkler_l(s1,s2):
""" Number of the four first characters matching """

l = 0
counter = 0
for s_j,s_i in zip(s1,s2):

if s_j == s_i:

l += 1
counter += 1

if counter > 4:
break

return l



def jarow_m(s1,s2):

""" Number of matching characters """
m = 0
d = {}
for s in s1:

d[s] = True

for s in s2:


if d.has_key(s):

m += 1
return m
def jarow_t(s1,s2):

"""
Number of transpositions

"""

t= 0
pos ={}
counter = 0
for s in s1:

pos[s] = counter
counter += 1
counter = 0
for s in s2:

if pos.has_key(s):

if pos[s] != counter:

t += 1

counter += 1

return t
--
http://mail.python.org/mailman/listinfo/python-list


Re: Jarow-Winkler algorithm: Measuring similarity between strings

2008-12-20 Thread Øyvind
Thanks for the useful comments.


On 20 Des, 01:38, John Machin  wrote:
> On Dec 20, 10:02 am, Øyvind  wrote:
>
> > Based on examples and formulas 
> > fromhttp://en.wikipedia.org/wiki/Jaro-Winkler.
>
> For another Python implementation, google "febrl".
>
> > Useful for measuring similarity between two strings. For example if
> > you want to detect that the user did a typo.
>
> You mean like comparing the user's input word with some collection of
> valid words? You would need to be using something else as a quick-and-
> dirty filter ... Jaro-Winkler is relatively slow.


Do you have any alternative  suggestions?

>
>
>
> > def jarow(s1,s2):
>
> >     """  Returns a number between 1 and 0, where 1 is the most similar
>
> >         example:
>
> >         print jarow("martha","marhta")
>
> >         """
> >     m= jarow_m(s1,s2)
> >     t1 = jarow_t(s1,s2)
> >     t2 = jarow_t(s2,s1)
> >     t = float(t1)/float(t2)
>
> Huh? t1 and t2 are supposed to be counts of transpositions. So is t.
> So how come t is a ratio of t1 to t2?? BTW, suppose t2 is zero.

Good point. There should be only one jarow_t.

>
>  Also as the Wikipedia article says,
> it's not a metric. I.e. it doesn't satisfy dist(a, c) <= dist(a, b) +
> dist(b, c).

Its not a mathematical correct metric, but it is a useful heuristical
metric.

>
> The above code is not symmetrical; jarow_m(s1, s2) does not
> necessarily equal jarow_m(s2, s1). The article talks about one "m",
> not two of them.
>


Hmm.. also a good point. I will make it count all the matches, not
just the matches in s1.



--
http://mail.python.org/mailman/listinfo/python-list


urllib.urlopen() with pages that requires cookies.

2006-04-27 Thread Øyvind Østlund
I am trying to visit a limited amount of web pages that requires
cookies. I will get redirected if my application does not handle them.
I am using urllib.urlopen() to visit the pages right now. And I need a
push in the right direction to find out how to deal with pages that
requires cookies. Anyone have any idea how to go about this?



Thanks in advance.
- ØØ -
-- 
http://mail.python.org/mailman/listinfo/python-list