Ermmm ... only remove "the" when you are sure it is a whole word. Even then it's a dodgy idea. In the first 1000 lines of the nearest address file I had to hand, I found these: Catherine, Matthew, Rotherwood, Weatherall, and "The Avenue".
Ermmm... don't rip out commas (or other punctuation); replace them with spaces. That way "SHORTMOOR,BEAMINSTER" doesn't become one word "SHORTMOORBEAMINSTER". A not-unreasonable similarity metric would be float(len(sa1 & sa2)) / len(sa1 | sa2). Even more reasonable would be to use trigrams instead of words -- more robust in the presence of erroneous insertion or deletion of spaces (e.g. Short Moor and Bea Minster are plausible variations) and spelling errors and typos. BTW, the OP's samples look astonishingly clean to me, so unlike real world data. Two general comments addressed to the OP: (1) Your solution doesn't handle the case where the postal code has been butchered. e.g. "DT8 BEL" or "OT8 3EL". (2) I endorse John Roth's comments. Validation against an address data base that is provided by the postal authority, using either an out-sourced bureau service, or buying a licence to use validation/standardisation/repair software, is IMHO the way to go. In Australia the postal authority assigns a unique ID to each delivery point. This "DPID" has to be barcoded onto the mail article to get bulk postage discounts. Storing the DPID on your database makes duplicate detection a snap. You can license s/w (from several vendors) that is certified by the postal authority and has batch and/or online APIs. I believe the situation in the UK is similar. At least one of the vendors in Australia is a British company. Google "address deduplication site:.uk" Actually (3): If you are constrained by budget, pointy-haired boss or hubris to write your own software (a) lots of luck (b) you need to do a bit more research -- look at the links on the febrl website, also Google for "Monge Elkan", read their initial paper, look at the papers referencing that on citeseer; also google for "merge purge"; also google for "record linkage" (what the statistical and medical fraternity call the problem) (c) and have a damn good look at your data [like I said, it looks too clean to be true] and (d) when you add a nice new wrinkle like "strip out 'the'", do make sure to run your regression tests :-) Would you believe (4): you are talking about cross-matching two databases -- don't forget the possibility of duplicates _within_ each database. HTH, John -- http://mail.python.org/mailman/listinfo/python-list