Jarow-Winkler algorithm: Measuring similarity between strings
Based on examples and formulas from http://en.wikipedia.org/wiki/Jaro-Winkler. Useful for measuring similarity between two strings. For example if you want to detect that the user did a typo. def jarow(s1,s2): """ Returns a number between 1 and 0, where 1 is the most similar example: print jarow("martha","marhta") """ m= jarow_m(s1,s2) t1 = jarow_t(s1,s2) t2 = jarow_t(s2,s1) t = float(t1)/float(t2) d = 0.1 # this is the jaro-distance d_j = 1.0/3.0 * ((m/len(s1)) + (m/len(s2)) + ((m - t)/float(m))) # if the strings are prefixed similar, they are weighted more heavily l = winkler_l(s1,s2) print l return d_j + (l * 0.1 * (1 - d_j)) def winkler_l(s1,s2): """ Number of the four first characters matching """ l = 0 counter = 0 for s_j,s_i in zip(s1,s2): if s_j == s_i: l += 1 counter += 1 if counter > 4: break return l def jarow_m(s1,s2): """ Number of matching characters """ m = 0 d = {} for s in s1: d[s] = True for s in s2: if d.has_key(s): m += 1 return m def jarow_t(s1,s2): """ Number of transpositions """ t= 0 pos ={} counter = 0 for s in s1: pos[s] = counter counter += 1 counter = 0 for s in s2: if pos.has_key(s): if pos[s] != counter: t += 1 counter += 1 return t -- http://mail.python.org/mailman/listinfo/python-list
Re: Jarow-Winkler algorithm: Measuring similarity between strings
Thanks for the useful comments. On 20 Des, 01:38, John Machin wrote: > On Dec 20, 10:02 am, Øyvind wrote: > > > Based on examples and formulas > > fromhttp://en.wikipedia.org/wiki/Jaro-Winkler. > > For another Python implementation, google "febrl". > > > Useful for measuring similarity between two strings. For example if > > you want to detect that the user did a typo. > > You mean like comparing the user's input word with some collection of > valid words? You would need to be using something else as a quick-and- > dirty filter ... Jaro-Winkler is relatively slow. Do you have any alternative suggestions? > > > > > def jarow(s1,s2): > > > """ Returns a number between 1 and 0, where 1 is the most similar > > > example: > > > print jarow("martha","marhta") > > > """ > > m= jarow_m(s1,s2) > > t1 = jarow_t(s1,s2) > > t2 = jarow_t(s2,s1) > > t = float(t1)/float(t2) > > Huh? t1 and t2 are supposed to be counts of transpositions. So is t. > So how come t is a ratio of t1 to t2?? BTW, suppose t2 is zero. Good point. There should be only one jarow_t. > > Also as the Wikipedia article says, > it's not a metric. I.e. it doesn't satisfy dist(a, c) <= dist(a, b) + > dist(b, c). Its not a mathematical correct metric, but it is a useful heuristical metric. > > The above code is not symmetrical; jarow_m(s1, s2) does not > necessarily equal jarow_m(s2, s1). The article talks about one "m", > not two of them. > Hmm.. also a good point. I will make it count all the matches, not just the matches in s1. -- http://mail.python.org/mailman/listinfo/python-list
urllib.urlopen() with pages that requires cookies.
I am trying to visit a limited amount of web pages that requires cookies. I will get redirected if my application does not handle them. I am using urllib.urlopen() to visit the pages right now. And I need a push in the right direction to find out how to deal with pages that requires cookies. Anyone have any idea how to go about this? Thanks in advance. - ØØ - -- http://mail.python.org/mailman/listinfo/python-list