Paul Kraus wrote: > I have to find duplicate customers in are customer file (around 60,000 > customers). > The file has been exported into a pipe delimited file. > > CustCode|Ship2Code|Name|Addr1|Addr2|City|State|ZipCode|Phone|Fax|Country > > Normally this task is done by printing it and someone going through it > manually to find them.
Duh? > The problem is the duplicates can be misspelled meaning you can't just > do an exact search. > My thinking was a couple of passes. Phone Numbers, Addresses, then > address digits & City. > > The first two will give me pretty secure matches and the third will > give some possibilities. > > I would like the script to process the file and then dump out the > lines to another file. > I cant figure out how to layout the script or what data structures to > use. I guess I would almost have to set it up like an old school > bubble sort routine. > > Any suggestions are ideas would be greatly appreciated. How about this: Read through the file one line at a time, keeping a hash of each field you want to check for matches. Each time you come across a record you think you haven't seen before print it to a new file. Suppose you were checking just the zip code (which seems like a good idea to me but wasn't in your list): my %zips_found; while (<>) { my @rec = split /|/; my $zip = $rec[7]; next if $zips_found{$zip}++; print; } You can add hashes for other fields and vary the behaviour until you think it's discarding the right records. You may also want to dump the discarded records into a different file so you can put them back! You're going to need to validate the values before you check for duplicates. For instance you don't want to discard all but one of the records with a blank zip code. HTH, Rob -- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]