Hi -

> -----Original Message-----
> From: Madhu Reddy [mailto:[EMAIL PROTECTED]
> Sent: Saturday, February 22, 2003 11:12 AM
> To: [EMAIL PROTECTED]
> Subject: Out of memory while finding duplicate rows
> 
> 
> Hi,
>   I have a script that will find out duplicate rows
> in a file...in a file i have 13 millions of
> records....
> out of that not morethan 5% are duplicate....
> 
> for finding duplicate i am using following function...
> 
> while (<FH>) {
>     if (find_duplicates ()) {
>            $dup++
>     } 
> 
> }
> 
> # return 1, if record is duplicate
> #returns 0, if record is not duplicate
> sub find_duplicates ()
> {
>       $key = substr($_,10,10);
>       if ( exists $keys{$key} ) {
>               $keys{$key}++;
>               return 1; #duplicate row
>       } else {
>               $keys{$key}++;
>               return 0;       #not a duplicate
>       }
> }
> ---------------------------------------------
> here i am storing 13 millions into hash...
> I think that is why i am getting out of memory.....
> 
> how to avoid this ?
> 
> Thanx
> -Madhu
> 

Yeah, Madhu, you are treading on the edge of memory
capabilties...

You may need to use a database (MySQL comes to mind),
and write a key-value table that could accomplish your
task as a hash would, and you could handle as many records
as your disk space allows.

Do you currently have a database installed? If you
are running on Windows, even Access would work. Have you
used the perl DBI (CPAN) interface?

Just some thoughts...

Aloha => Beau;


-- 
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to