RE: Out of memory while finding duplicate rows

Madhu Reddy Sat, 22 Feb 2003 22:25:51 -0800

Hi,
   those data finally have to load into database..
before loading into dabase,,we need to do some
validations like remove duplicate etc.....
that is why i am doing...


i have another idea..
What i am planning is first sort the file..
once we sort the file..then it is easy to find
duplicate....

$first =1;
while (<FH>)
{
    if ($first) {
       my $last_key = substr($_,10,10);
        last_$_ = $_;
        $first = 0;
       next;
    }
    my $cur_key =  substr($_,10,10);
    if ($last_key  == $cur_key) {
        print dup_file last_$_;
         $dup++;
   } else {
         print good_file $_;
         $good++;
   }
   last_$_ = $_;
}

then only thing is we need to find out how much time
it will take to sort 22 millino rows file....


Any comments ? sugegestions ?




--- "Beau E. Cox" <[EMAIL PROTECTED]> wrote:
> Hi -
> 
> > -----Original Message-----
> > From: Madhu Reddy [mailto:[EMAIL PROTECTED]
> > Sent: Saturday, February 22, 2003 11:12 AM
> > To: [EMAIL PROTECTED]
> > Subject: Out of memory while finding duplicate
> rows
> > 
> > 
> > Hi,
> >   I have a script that will find out duplicate
> rows
> > in a file...in a file i have 13 millions of
> > records....
> > out of that not morethan 5% are duplicate....
> > 
> > for finding duplicate i am using following
> function...
> > 
> > while (<FH>) {
> >     if (find_duplicates ()) {
> >            $dup++
> >     } 
> > 
> > }
> > 
> > # return 1, if record is duplicate
> > #returns 0, if record is not duplicate
> > sub find_duplicates ()
> > {
> >     $key = substr($_,10,10);
> >     if ( exists $keys{$key} ) {
> >             $keys{$key}++;
> >             return 1; #duplicate row
> >     } else {
> >             $keys{$key}++;
> >             return 0;       #not a duplicate
> >     }
> > }
> > ---------------------------------------------
> > here i am storing 13 millions into hash...
> > I think that is why i am getting out of
> memory.....
> > 
> > how to avoid this ?
> > 
> > Thanx
> > -Madhu
> > 
> 
> Yeah, Madhu, you are treading on the edge of memory
> capabilties...
> 
> You may need to use a database (MySQL comes to
> mind),
> and write a key-value table that could accomplish
> your
> task as a hash would, and you could handle as many
> records
> as your disk space allows.
> 
> Do you currently have a database installed? If you
> are running on Windows, even Access would work. Have
> you
> used the perl DBI (CPAN) interface?
> 
> Just some thoughts...
> 
> Aloha => Beau;
> 
> 
> -- 
> To unsubscribe, e-mail:
> [EMAIL PROTECTED]
> For additional commands, e-mail:
> [EMAIL PROTECTED]
> 


__________________________________________________
Do you Yahoo!?
Yahoo! Tax Center - forms, calculators, tips, more
http://taxes.yahoo.com/

-- 
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

RE: Out of memory while finding duplicate rows

Reply via email to