----- Original Message ----- From: Graeme St. Clair To: beginners@perl.org Sent: Thursday, February 24, 2005 2:24 PM Subject: RE: standardising spellings
-----Original Message----- From: Peter Rabbitson [mailto:[EMAIL PROTECTED] Sent: Thursday, February 24, 2005 2:01 PM To: beginners@perl.org Subject: Re: standardising spellings On Thu, Feb 24, 2005 at 06:01:50PM -0000, Dermot Paikkos wrote: > Hi, > > I have a list of about 650 names (a small sample is below) that I need > to import into a database. When you look at the list there are some > obvious duplicates that are spelt slightly differently. I can > rationalize some of the data with some simple substitutions but some > of the data looks almost impossible to parse programmatically. > Here what I have done so far - it's not much: > I would use String::Approx's amatch, and run the list in several rounds. The first round would look for possible 1 step mismatches, then 2 step then 4 step then 6 step etc. Every time you interactively confirm a delete, it is deleted from some kind of global hash, so the next round will not find the duplicate a second time. Or if it is a one time show - just write a simple thing that will run through the list amatching a fixed number of steps, and delete everything you confirm, writing the result to a file. Then increase the step and do it again and again untill you get tired of it :) ##### Neat! Being one, I particularly enjoyed the "perldoc String::Approx" bit where it compared McScot to MacScot. I saw this kind of thing attempted in an old IBM VM CMS user-written utility called SCANCMS. I don't have the code any more, but it did things like collapse "nn" into "n", "turn "sky" into "ski", drop all h's and such like. So Coffman, Kaufmann and Kauffman would all end up as Cofman or even Cfmn for the purposes of the comparison. I don't know if this approach would be any improvement on Approx; probably not, tho it does look like Approx is a binary black box, and this approach might be more modifiable. Rgds, GStC. -- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] <http://learn.perl.org/> <http://learn.perl.org/first-response> Adding to Graeme's suggestions regarding a user-written algorithm, I sent this to Dermot, thinking no one would be interested: FYI, I used to do a lot of work for a direct mail - fullfilment house. We endeavored to take lists of addresses, supplied on reel-to-reel tapes with different record formats (sometimes hundreds of thousands of records) and attempt to identify duplicates for avoiding duplicate mailings and the unnecessary postage and piece cost, etc. The algorithm for the match code generation is subject to the nature of the lists and record format(s), but typically, we would do things like 1) phonetically spell the name (i.e., strip out the vowels); 2) perhaps drop any middle initial; 3) concatenate perhpas three of any numerical street address with some component of the zip code (i.e., your postal code, . . .I notice that your in the UK); and finally 4) string all of this together in a single "match code" that was saved in a standard name-and-address record format. Once all tapes (files) from different sources were read and addresses added to the standard file, including duplicates with the generated match code, then a purge process was run against the file, sorted by match code so that any record with a duplicate match code could be deleted. This is an over-simplified scenario (regarding the match-code generation; think about prefixes like Mr., Mrs., Ms., Hon., Honerable, Pres., Dr. etc.; suffixes such as I, II, Esq., etc.; or company names), but I think you get the idea. BTW, the code was written in RPGII and ran on an IBM S/38 (. . .predecessor to the AS/400). I later wrote a similar system in C, that ran on SVR3 AT&T Unix. I might add that having functions like printf() , sprintf(), and regular expressions would have made the old RPGII code a lot easier to manipulate. OTTF, Ron W. ------------------------------------------------------------------------------