Hi, those data finally have to load into database.. before loading into dabase,,we need to do some validations like remove duplicate etc..... that is why i am doing...
i have another idea.. What i am planning is first sort the file.. once we sort the file..then it is easy to find duplicate.... $first =1; while (<FH>) { if ($first) { my $last_key = substr($_,10,10); last_$_ = $_; $first = 0; next; } my $cur_key = substr($_,10,10); if ($last_key == $cur_key) { print dup_file last_$_; $dup++; } else { print good_file $_; $good++; } last_$_ = $_; } then only thing is we need to find out how much time it will take to sort 22 millino rows file.... Any comments ? sugegestions ? --- "Beau E. Cox" <[EMAIL PROTECTED]> wrote: > Hi - > > > -----Original Message----- > > From: Madhu Reddy [mailto:[EMAIL PROTECTED] > > Sent: Saturday, February 22, 2003 11:12 AM > > To: [EMAIL PROTECTED] > > Subject: Out of memory while finding duplicate > rows > > > > > > Hi, > > I have a script that will find out duplicate > rows > > in a file...in a file i have 13 millions of > > records.... > > out of that not morethan 5% are duplicate.... > > > > for finding duplicate i am using following > function... > > > > while (<FH>) { > > if (find_duplicates ()) { > > $dup++ > > } > > > > } > > > > # return 1, if record is duplicate > > #returns 0, if record is not duplicate > > sub find_duplicates () > > { > > $key = substr($_,10,10); > > if ( exists $keys{$key} ) { > > $keys{$key}++; > > return 1; #duplicate row > > } else { > > $keys{$key}++; > > return 0; #not a duplicate > > } > > } > > --------------------------------------------- > > here i am storing 13 millions into hash... > > I think that is why i am getting out of > memory..... > > > > how to avoid this ? > > > > Thanx > > -Madhu > > > > Yeah, Madhu, you are treading on the edge of memory > capabilties... > > You may need to use a database (MySQL comes to > mind), > and write a key-value table that could accomplish > your > task as a hash would, and you could handle as many > records > as your disk space allows. > > Do you currently have a database installed? If you > are running on Windows, even Access would work. Have > you > used the perl DBI (CPAN) interface? > > Just some thoughts... > > Aloha => Beau; > > > -- > To unsubscribe, e-mail: > [EMAIL PROTECTED] > For additional commands, e-mail: > [EMAIL PROTECTED] > __________________________________________________ Do you Yahoo!? Yahoo! Tax Center - forms, calculators, tips, more http://taxes.yahoo.com/ -- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]