Thanks for all the suggestions. There were some really useful pointers.

A few random points:

1. Spending money is not an option, this is a 'volunteer' project. I'll try out some of the ideas over the weekend.

2. Someone commented that the data was suspiciously good quality. The data sources are both ones that you might expect to be authoritative. If you use as a metric, having a correctly formatted and valid postcode, in one database 100% the records do in the other 99.96% do.

3. I've already noticed duplicate addresses in one of the databases.

4. You need to be careful doing an endswith search. It was actually my first approach to the house name issue. The problem is you end up matching "12 Acacia Avenue, ..." with "2 Acacia Avenue, ...".

I am tempted to try an approach based on splitting the address into a sequence of normalised tokens. Then work with a metric based on the differences between the sequences. The simple case would look at deleting tokens and perhaps concatenating tokens to make a match.

--
Andrew McLean
--
http://mail.python.org/mailman/listinfo/python-list

Reply via email to