Madhu Reddy wrote:
> C1 C2 C3 C4
> ------------------------
> 12345 efghij klmno pqrs
> 34567 abnerv oiuuy uyrv
> 94567 abnerv gtuuy hyrv
> 12345 aswrfr rtyyt erer
> 94567 abnerv gtuuy hyrv
>
>
> Here row1 and row4 are duplicates...those needs
> to be removed or moved to another file
>
db is the way to go but if you need to clean the data before letting it
enter the db, you could do a little trick:
#!/usr/bin/perl -w
use strict;
my $pre = undef;
my $data = "/tmp/data.file";
system("sort -n -k 1,1 -o $data $data") && die $?;
open(DATA,$data) || die $!;
while(<DATA>){
/^(\d+)/;
print if(!defined($pre) || $pre != $1);
$pre = $1;
}
close(DATA);
__END__
this should remove the dups and clean the data for you. the scripts assumes
that you are running *nix and has the sort utility and assume that for all
dups, the first entry is used and the rest is discarded. i decided not to
use Perl's sort function mainly because for 22m rows, it could be slow. it
also assume the columns in your data file is separate by a single space.
david
--
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]