RE: Out of memory while finding duplicate rows

Beau E. Cox Sun, 23 Feb 2003 01:22:50 -0800

Hi -

Wait! If you are going to load the data into a database anyway,
why not use the existing database (or the one being created) to
remove duplicates. You don't even have to have an index on the
column you are making unique (but it would be _much_ faster).
Just select on you key, and, if found, reject the datum as
a duplicate. You really shouldn't have to go to any draconian
measures to find duplicates!


Aloha => Beau;

> -----Original Message-----
> From: Madhu Reddy [mailto:[EMAIL PROTECTED]
> Sent: Saturday, February 22, 2003 8:26 PM
> To: Beau E. Cox; [EMAIL PROTECTED]
> Subject: RE: Out of memory while finding duplicate rows
> 
> 
> Hi,
>    those data finally have to load into database..
> before loading into dabase,,we need to do some
> validations like remove duplicate etc.....
> that is why i am doing...
> 
> i have another idea..
> What i am planning is first sort the file..
> once we sort the file..then it is easy to find
> duplicate....
> 
> $first =1;
> while (<FH>)
> {
>     if ($first) {
>        my $last_key = substr($_,10,10);
>         last_$_ = $_;
>         $first = 0;
>        next;
>     }
>     my $cur_key =  substr($_,10,10);
>     if ($last_key  == $cur_key) {
>         print dup_file last_$_;
>          $dup++;
>    } else {
>          print good_file $_;
>          $good++;
>    }
>    last_$_ = $_;
> }
> 
> then only thing is we need to find out how much time
> it will take to sort 22 millino rows file....
> 
> 
> Any comments ? sugegestions ?
> 
> 
> 
> 
> --- "Beau E. Cox" <[EMAIL PROTECTED]> wrote:
> > Hi -
> > 
> > > -----Original Message-----
> > > From: Madhu Reddy [mailto:[EMAIL PROTECTED]
> > > Sent: Saturday, February 22, 2003 11:12 AM
> > > To: [EMAIL PROTECTED]
> > > Subject: Out of memory while finding duplicate
> > rows
> > > 
> > > 
> > > Hi,
> > >   I have a script that will find out duplicate
> > rows
> > > in a file...in a file i have 13 millions of
> > > records....
> > > out of that not morethan 5% are duplicate....
> > > 
> > > for finding duplicate i am using following
> > function...
> > > 
> > > while (<FH>) {
> > >     if (find_duplicates ()) {
> > >            $dup++
> > >     } 
> > > 
> > > }
> > > 
> > > # return 1, if record is duplicate
> > > #returns 0, if record is not duplicate
> > > sub find_duplicates ()
> > > {
> > >   $key = substr($_,10,10);
> > >   if ( exists $keys{$key} ) {
> > >           $keys{$key}++;
> > >           return 1; #duplicate row
> > >   } else {
> > >           $keys{$key}++;
> > >           return 0;       #not a duplicate
> > >   }
> > > }
> > > ---------------------------------------------
> > > here i am storing 13 millions into hash...
> > > I think that is why i am getting out of
> > memory.....
> > > 
> > > how to avoid this ?
> > > 
> > > Thanx
> > > -Madhu
> > > 
> > 
> > Yeah, Madhu, you are treading on the edge of memory
> > capabilties...
> > 
> > You may need to use a database (MySQL comes to
> > mind),
> > and write a key-value table that could accomplish
> > your
> > task as a hash would, and you could handle as many
> > records
> > as your disk space allows.
> > 
> > Do you currently have a database installed? If you
> > are running on Windows, even Access would work. Have
> > you
> > used the perl DBI (CPAN) interface?
> > 
> > Just some thoughts...
> > 
> > Aloha => Beau;
> > 
> > 
> > -- 
> > To unsubscribe, e-mail:
> > [EMAIL PROTECTED]
> > For additional commands, e-mail:
> > [EMAIL PROTECTED]
> > 
> 
> 
> __________________________________________________
> Do you Yahoo!?
> Yahoo! Tax Center - forms, calculators, tips, more
> http://taxes.yahoo.com/
> 
> -- 
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 
> 


-- 
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

RE: Out of memory while finding duplicate rows

Reply via email to