Andrew, > Basically, I have two databases containing lists of postal addresses and > need to look for matching addresses in the two databases. More > precisely, for each address in database A I want to find a single > matching address in database B.
What percent of addresses in A have a unique corresponding address in B? (i.e. how many addresses will have some match in B?) This is a standard document retrieval task. Whole books could be written about the topic. (In fact, many have been). I suggest you don't waste your time trying to solve this problem from scratch, and instead capitalize on the effort of others. Hence, my proposal is pretty simple: 1. Regularize the punctuation of the text (e.g. convert it all to uppercase), since it is uninformative and---at best---a confounding variable. 2. Use a free information retrieval package to find matches. e.g. LEMUR: http://www-2.cs.cmu.edu/~lemur/ In this case, a "document" is an address in Database B. A "query" is an address in Database A. (Alternately, you could switch A and B to see if that affects accuracy.) Good luck. Joseph -- http://mail.python.org/mailman/listinfo/python-list