Andrew McLean <[EMAIL PROTECTED]> wrote: > > Thanks for all the suggestions. There were some really useful pointers. > > A few random points: > > 1. Spending money is not an option, this is a 'volunteer' project. I'll > try out some of the ideas over the weekend. > ... > I am tempted to try an approach based on splitting the address into a > sequence of normalised tokens. Then work with a metric based on the > differences between the sequences. The simple case would look at > deleting tokens and perhaps concatenating tokens to make a match.
Do please have a look at the Febrl project at http://febrl.sf.net We would be most interested to learn how well its HMM address parser works well for all those "quaint" English addresses, and its Fellegi-Sunter probabilistic matching engine should give good results on your data (or use the simpler deterministic engine if you like). Provided that your data are not too large (eg more than a few hundred thousand records), Febrl should work fairly well. We'd be pleased to get any feedback you may have. Tim C -- http://mail.python.org/mailman/listinfo/python-list