On Thu, Feb 24, 2005 at 06:01:50PM -0000, Dermot Paikkos wrote: > Hi, > > I have a list of about 650 names (a small sample is below) that I > need to import into a database. When you look at the list there are > some obvious duplicates that are spelt slightly differently. I can > rationalize some of the data with some simple substitutions but some > of the data looks almost impossible to parse programmatically. > Here what I have done so far - it's not much: >
I would use String::Approx's amatch, and run the list in several rounds. The first round would look for possible 1 step mismatches, then 2 step then 4 step then 6 step etc. Every time you interactively confirm a delete, it is deleted from some kind of global hash, so the next round will not find the duplicate a second time. Or if it is a one time show - just write a simple thing that will run through the list amatching a fixed number of steps, and delete everything you confirm, writing the result to a file. Then increase the step and do it again and again untill you get tired of it :) -- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] <http://learn.perl.org/> <http://learn.perl.org/first-response>