Re: standardising spellings

Ron Wingfield Thu, 24 Feb 2005 13:10:18 -0800

  ----- Original Message ----- 
  From: Graeme St. Clair 
  To: beginners@perl.org 
  Sent: Thursday, February 24, 2005 2:24 PM
  Subject: RE: standardising spellings

  -----Original Message-----
  From: Peter Rabbitson [mailto:[EMAIL PROTECTED] 
  Sent: Thursday, February 24, 2005 2:01 PM
  To: beginners@perl.org
  Subject: Re: standardising spellings

  On Thu, Feb 24, 2005 at 06:01:50PM -0000, Dermot Paikkos wrote:
  > Hi,
  > 
  > I have a list of about 650 names (a small sample is below) that I need 
  > to import into a database. When you look at the list there are some 
  > obvious duplicates that are spelt slightly differently. I can 
  > rationalize some of the data with some simple substitutions but some 
  > of the data looks almost impossible to parse programmatically.
  > Here what I have done so far - it's not much:
  > 

  I would use String::Approx's amatch, and run the list in several rounds. The
  first round would look for possible 1 step mismatches, then 2 step then 4
  step then 6 step etc. Every time you interactively confirm a delete, it is
  deleted from some kind of global hash, so the next round will not find the
  duplicate a second time. Or if it is a one time show - just write a simple
  thing that will run through the list amatching a fixed number of steps, and
  delete everything you confirm, writing the result to a file. Then increase
  the step and do it again and again untill you get tired of it :)

  #####

  Neat!  Being one, I particularly enjoyed the "perldoc String::Approx" bit
  where it compared McScot to MacScot.

  I saw this kind of thing attempted in an old IBM VM CMS user-written utility
  called SCANCMS.  I don't have the code any more, but it did things like
  collapse "nn" into "n", "turn "sky" into "ski", drop all h's and such like.
  So Coffman, Kaufmann and Kauffman would all end up as Cofman or even Cfmn
  for the purposes of the comparison.

  I don't know if this approach would be any improvement on Approx; probably
  not, tho it does look like Approx is a binary black box, and this approach
  might be more modifiable.

  Rgds, GStC.

  -- 
  To unsubscribe, e-mail: [EMAIL PROTECTED]
  For additional commands, e-mail: [EMAIL PROTECTED]
  <http://learn.perl.org/> <http://learn.perl.org/first-response>

  Adding to Graeme's suggestions regarding a user-written algorithm, I sent 
this to Dermot, thinking no one would be interested:

    FYI, I used to do a lot of work for a direct mail - fullfilment house.  We 
endeavored to take lists of addresses, supplied on reel-to-reel tapes with 
different record formats (sometimes hundreds of thousands of records) and 
attempt to identify duplicates for avoiding duplicate mailings and the 
unnecessary postage and piece cost, etc.  The algorithm for the match code 
generation is subject to the nature of the lists and record format(s), but 
typically, we would do things like 1) phonetically spell the name (i.e., strip 
out the vowels); 2) perhaps drop any middle initial; 3) concatenate perhpas 
three of any numerical street address with some component of the zip code 
(i.e., your postal code, . . .I notice that your in the UK); and finally 4) 
string all of this together in a single "match code" that was saved in a 
standard name-and-address record format.  

    Once all tapes (files) from different sources were read and addresses added 
to the standard file, including duplicates with the generated match code, then 
a purge process was run against the file, sorted by match code so that any 
record with a duplicate match code could be deleted.

    This is an over-simplified scenario (regarding the match-code generation; 
think about prefixes like Mr., Mrs., Ms., Hon., Honerable, Pres., Dr. etc.; 
suffixes such as I, II, Esq., etc.; or company names), but I think you get the 
idea.
  BTW, the code was written in RPGII and ran on an IBM S/38 (. . .predecessor 
to the AS/400).  I later wrote a similar system in C, that ran on SVR3 AT&T 
Unix.  I might add that having functions like printf() , sprintf(), and regular 
expressions would have made the old RPGII code a lot easier to manipulate.

  OTTF,
  Ron W. 
------------------------------------------------------------------------------

Re: standardising spellings

Reply via email to