On 29/06/2006 10:52 AM, John Machin wrote: > On 29/06/2006 10:07 AM, BBands wrote: >> On 6/28/06, John Machin <[EMAIL PROTECTED]> wrote: >>> On 29/06/2006 9:28 AM, BBands wrote: >>> > I'd like to see if a string exists, even approximately, in another. >>> For >>> > example if "black" exists in "blakbird" or if "beatles" exists in >>> > "beatlemania". The application is to look though a long list of songs >>> > and return any approximate matches along with a confidence factor. I >>> > have looked at edit distance, but that isn't a good choice for finding >>> > a short string in a longer one. >>> >>> There is a trivial difference between the traditional >>> distance-matrix-based Levenshtein algorithm for edit distance and the >>> corresponding one for approximate string searching. Ditto between >>> finite-state-machine approaches. Ditto between modern bit-bashing >>> approaches.
The trivial difference is that in the searching case one edge of the dynamic programming matrix is initialised (in theory) to [0] * (len(text)+1) whereas in the distance case it is set to range(len(text)+1). >>> >>> > I have also explored >>> > difflib.SequenceMatcher and .get_close_matches, but what I'd really >>> > like is something like: >>> > >>> > a = FindApprox("beatles", "beatlemania") >>> > print a >>> > 0.857 >>> > >>> > Any ideas? >>> >>> You got no ideas from googling "approximate string search python"??? >> >> Yes, many including agrepy and soundex in addition to those I >> mentioned already, but none seem really handy at approximately looking >> up smaller strings in larger ones. I also note that this has been the >> topic of prior discussion without resolutiuon. >> >> jab > > It helps if you tell all that you've done. Otherwise people will tell > you to do what you've done already :-) > > It helps if you reply on-list so that others can see. You may get better > answers sooner. I have to vanish now. Will check back tonight. > Here's a possibly better answer: def approx_search(pattern, text, max_dist): """ Return a generator which yields tuples (endpos, dist) for each possible endpos such that (1) Levenshtein_distance(pattern, text[startpos:endpos]) = dist (2) 0 <= dist <= max_dist for some (undetermined) startpos. Not restricted to strings: (a) pattern must support pattern[i] and len(pattern). (b) text needs only to support enumerate(text). (c) contents of pattern and text need only support the != operator. Example: list(approx_search('Beatles', 'beetles Beat leverage Betelguese', 3)) -> [(6, 3), (7, 2), (8, 3), (12, 3), (13, 3), (14, 3), (15, 2), (16, 2), (17, 3), (26, 3), (27, 3)] """ plen = len(pattern) prange = range(plen) dprev = range(plen+1) dcurr = [0] * (plen+1) dist = plen for tx, tc in enumerate(text): # dcurr[0] = 0 # not needed, never changes for px in prange: dist = dprev[px] + (tc != pattern[px]) temp = dcurr[px] + 1 if temp < dist: dist = temp temp = dprev[px+1] + 1 if temp < dist: dist = temp dcurr[px+1] = dist if dist <= max_dist: yield tx+1, dist dprev, dcurr = dcurr, dprev If you need/want to poke around discovering "startpos", then you need something like the following, which is similar to the function by Magnus Lie Hetland which you'll find on the web, but appears to be about twice as fast without destroying legibility completely :-) def Levenshtein_SJM(aseq, bseq): """ Calculate Levenshtein distance between aseq and bseq. Space O(min(m,n)) Time O(mn) """ alen = len(aseq) blen = len(bseq) if alen > blen: aseq, bseq = bseq, aseq alen, blen = blen, alen if not alen: return blen arange = range(alen) dprev = range(alen+1) dcurr = [0] * (alen+1) for bx in xrange(blen): bc = bseq[bx] dcurr[0] = bx + 1 for ax in arange: dist = dprev[ax] + (bc != aseq[ax]) temp = dcurr[ax] + 1 if temp < dist: dist = temp temp = dprev[ax+1] + 1 if temp < dist: dist = temp dcurr[ax+1] = dist dprev, dcurr = dcurr, dprev return dist Note that both of the above functions use the ancient original algorithm, which takes O(mn) time. For heavy lifting a modern algorithm and use of Pyrex or C are indicated. HTH, John -- http://mail.python.org/mailman/listinfo/python-list