On Thu, Feb 24, 2005 at 06:01:50PM -0000, Dermot Paikkos wrote:
> Hi,
> 
> I have a list of about 650 names (a small sample is below) that I 
> need to import into a database. When you look at the list there are 
> some obvious duplicates that are spelt slightly differently. I can 
> rationalize some of the data with some simple substitutions but some 
> of the data looks almost impossible to parse programmatically.
> Here what I have done so far - it's not much:
> 

I would use String::Approx's amatch, and run the list in several rounds. The 
first round would look for possible 1 step mismatches, then 2 step then 4 
step then 6 step etc. Every time you interactively confirm a delete, it is 
deleted from some kind of global hash, so the next round will not find the 
duplicate a second time. Or if it is a one time show - just write a simple 
thing that will run through the list amatching a fixed number of steps, and 
delete everything you confirm, writing the result to a file. Then increase 
the step and do it again and again untill you get tired of it :)

-- 
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
<http://learn.perl.org/> <http://learn.perl.org/first-response>


Reply via email to