Re: Finding Duplicates.

Rob Dixon Tue, 11 Feb 2003 07:53:55 -0800

Paul Kraus wrote:
> I have to find duplicate customers in are customer file (around 60,000
> customers).
> The file has been exported into a pipe delimited file.
>
>
CustCode|Ship2Code|Name|Addr1|Addr2|City|State|ZipCode|Phone|Fax|Country
>
> Normally this task is done by printing it and someone going through it
> manually to find them.


Duh?

> The problem is the duplicates can be misspelled meaning you can't just
> do an exact search.
> My thinking was a couple of passes. Phone Numbers, Addresses, then
> address digits & City.
>
> The first two will give me pretty secure matches and the third will
> give some possibilities.
>
> I would like the script to process the file and then dump out the
> lines to another file.
> I cant figure out how to layout the script or what data structures to
> use. I guess I would almost have to set it up like an old school
> bubble sort routine.
>
> Any suggestions are ideas would be greatly appreciated.

How about this:

Read through the file one line at a time, keeping a hash of
each field you want to check for matches. Each time you
come across a record you think you haven't seen before
print it to a new file. Suppose you were checking just the zip
code (which seems like a good idea to me but wasn't in your
list):

    my %zips_found;

    while (<>) {
        my @rec = split /|/;
        my $zip = $rec[7];
        next if $zips_found{$zip}++;
        print;
    }

You can add hashes for other fields and vary the behaviour
until you think it's discarding the right records. You may
also want to dump the discarded records into a different
file so you can put them back!

You're going to need to validate the values before you
check for duplicates. For instance you don't want to
discard all but one of the records with a blank zip code.

HTH,

Rob




-- 
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Finding Duplicates.

Reply via email to