Andrew McLean <[EMAIL PROTECTED]> wrote:
> 
> Thanks for all the suggestions. There were some really useful pointers.
> 
> A few random points:
> 
> 1. Spending money is not an option, this is a 'volunteer' project. I'll 
> try out some of the ideas over the weekend.
> ...
> I am tempted to try an approach based on splitting the address into a 
> sequence of normalised tokens. Then work with a metric based on the 
> differences between the sequences. The simple case would look at 
> deleting tokens and perhaps concatenating tokens to make a match.

Do please have a look at the Febrl project at http://febrl.sf.net

We would be most interested to learn how well its HMM address parser works well 
for 
all those "quaint" English addresses, and its Fellegi-Sunter probabilistic 
matching 
engine should give good results on your data (or use the simpler deterministic 
engine if 
you like). Provided that your data are not too large (eg more than a few 
hundred 
thousand records), Febrl should work fairly well. We'd be pleased to get any 
feedback 
you may have.

Tim C
-- 
http://mail.python.org/mailman/listinfo/python-list

Reply via email to