Re: Fuzzy matching of postal addresses [1/1]

Andrew McLean Sun, 23 Jan 2005 16:55:06 -0800

In article <[EMAIL PROTECTED]>, John Machin <[EMAIL PROTECTED]> writes

Andrew McLean wrote:

In case anyone is interested, here is the latest.

def insCost(tokenList, indx, pos):
    """The cost of inserting a specific token at a specific

normalised position along the sequence."""

    if containsNumber(tokenList[indx]):
        return INSERT_TOKEN_WITH_NUMBER + POSITION_WEIGHT * (1 - pos)
    elif indx > 0 and containsNumber(tokenList[indx-1]):
        return INSERT_TOKEN_AFTER_NUMBER + POSITION_WEIGHT * (1 -

pos)

    elif tokenList[indx][0] in minorTokenList:
        return INSERT_MINOR_TOKEN
    else:
        return INSERT_TOKEN + POSITION_WEIGHT * (1 - pos)

def delCost(tokenList, indx, pos):
    """The cost of deleting a specific token at a specific normalised

position along the sequence.

    This is exactly the same cost as inserting a token."""
    return insCost(tokenList, indx, pos)


Functions are first-class citizens of Pythonia -- so just do this:

delCost = insCost

Actually, the code used to look like that. I think I changed it so that it would look clearer. But perhaps that was a bad idea.

Re speed generally: (1) How many addresses in each list and how long is
it taking? On what sort of configuration? (2) Have you considered using
pysco -- if not running on x86 architecture, consider exporting your
files to a grunty PC and doing the match there. (3) Have you considered
some relatively fast filter to pre-qualify pairs of addresses before
you pass the pair to your relatively slow routine?


There are approx. 50,000 addresses in each list.

At the moment the processing assumes all the postcodes are correct, and only compares addresses with matching postcodes. This makes it a lot faster. But may miss some cases of mismatched postcodes.

Also it does two passes. One looking for exact matches of token sequences. This deals with about half the cases. Only then do I employ the, more expensive, edit distance technique.

Overall, the program runs in less than half an hour. Specifically it takes about 60s per thousand addresses, which requires an average of about 8 calls to editDistance per address. Psyco.full() reduced the 60s to 45s.

I'll only try optimisation if I need to use it much more.

Soundex?? To put it bluntly, the _only_ problem to which soundex is the
preferred solution is genealogy searching in the US census records, and
even then one needs to know what varieties of the algorithm were in use
at what times. I thought you said your addresses came from
authoritative sources. You have phonetic errors? Can you give some
examples of pairs of tokens that illustrate the problem you are trying
to overcome with soundex?

I'm sure that in retrospect Soundex might not be a good choice. The misspellings tend to be minor, e.g. Kitwhistle and KITTWHISTLE Tythe and TITHE

I was tempted by an edit distance technique on the tokens, but would prefer a hash based method for efficiency reasons.

Back to speed again: When you look carefully at the dynamic programming
algorithm for edit distance, you will note that it is _not_ necessary
to instantiate the whole NxM matrix -- it only ever refers to the
current row and the previous row. What does space saving have to do
with speed, you ask? Well, Python is not FORTRAN; it takes considerable
effort to evaluate d[i][j]. A relatively simple trick is to keep 2 rows
and swap (the pointers to) them each time around the outer loop. At the
expense of a little more complexity, one can reduce this to one row and
3 variables (north, northwest, and west) corresponding to d[i-1][j],
d[i-1][j-1], and d[i][j-1] -- but I'd suggest the simple way first.
Hope some of this helps,

Thanks for that. The first edit distance algorithm I looked at did it that way. But I based my code on a version of the algorithm I could actually follow ;-). Getting the concatenation bit right was non-trivial and it was useful to store all of "d" for debugging purposes.

As to Python not being Fortran. You've found me out. The three languages I am most comfortable with are Fortran, Matlab and Python. It did occur to me that numarray might be a more efficient way of dealing with a 4-dimensional array, but the arrays aren't very big, so the overhead in setting them up might be significant.

The simplest optimisation would be to replace the two indices used to deal with concatenation by four explicit variables. And then, as you said, I could just store the three last rows, and avoid any multiple indexing.

As with all these potential optimisations, you don't know until you try them.

--
Andrew McLean
--
http://mail.python.org/mailman/listinfo/python-list

Re: Fuzzy matching of postal addresses [1/1]

Reply via email to