Andrew McLean wrote: > In case anyone is interested, here is the latest.
> def insCost(tokenList, indx, pos): > """The cost of inserting a specific token at a specific normalised position along the sequence.""" > if containsNumber(tokenList[indx]): > return INSERT_TOKEN_WITH_NUMBER + POSITION_WEIGHT * (1 - pos) > elif indx > 0 and containsNumber(tokenList[indx-1]): > return INSERT_TOKEN_AFTER_NUMBER + POSITION_WEIGHT * (1 - pos) > elif tokenList[indx][0] in minorTokenList: > return INSERT_MINOR_TOKEN > else: > return INSERT_TOKEN + POSITION_WEIGHT * (1 - pos) > > def delCost(tokenList, indx, pos): > """The cost of deleting a specific token at a specific normalised position along the sequence. > This is exactly the same cost as inserting a token.""" > return insCost(tokenList, indx, pos) Functions are first-class citizens of Pythonia -- so just do this: delCost = insCost Re speed generally: (1) How many addresses in each list and how long is it taking? On what sort of configuration? (2) Have you considered using pysco -- if not running on x86 architecture, consider exporting your files to a grunty PC and doing the match there. (3) Have you considered some relatively fast filter to pre-qualify pairs of addresses before you pass the pair to your relatively slow routine? Soundex?? To put it bluntly, the _only_ problem to which soundex is the preferred solution is genealogy searching in the US census records, and even then one needs to know what varieties of the algorithm were in use at what times. I thought you said your addresses came from authoritative sources. You have phonetic errors? Can you give some examples of pairs of tokens that illustrate the problem you are trying to overcome with soundex? Back to speed again: When you look carefully at the dynamic programming algorithm for edit distance, you will note that it is _not_ necessary to instantiate the whole NxM matrix -- it only ever refers to the current row and the previous row. What does space saving have to do with speed, you ask? Well, Python is not FORTRAN; it takes considerable effort to evaluate d[i][j]. A relatively simple trick is to keep 2 rows and swap (the pointers to) them each time around the outer loop. At the expense of a little more complexity, one can reduce this to one row and 3 variables (north, northwest, and west) corresponding to d[i-1][j], d[i-1][j-1], and d[i][j-1] -- but I'd suggest the simple way first. Hope some of this helps, John -- http://mail.python.org/mailman/listinfo/python-list