Madhu Reddy wrote:

>   C1     C2     C3   C4
>  ------------------------
>  12345 efghij klmno pqrs
>  34567 abnerv oiuuy uyrv
>  94567 abnerv gtuuy hyrv
>  12345 aswrfr rtyyt erer
>  94567 abnerv gtuuy hyrv
> 
> 
> Here row1 and row4 are duplicates...those needs
> to be removed or moved to another file
> 

db is the way to go but if you need to clean the data before letting it 
enter the db, you could do a little trick:

#!/usr/bin/perl -w
use strict;

my $pre = undef;
my $data = "/tmp/data.file";

system("sort -n -k 1,1 -o $data $data") && die $?;

open(DATA,$data) || die $!;
while(<DATA>){
        /^(\d+)/;
        print if(!defined($pre) || $pre != $1);
        $pre = $1;
}
close(DATA);

__END__

this should remove the dups and clean the data for you. the scripts assumes 
that you are running *nix and has the sort utility and assume that for all 
dups, the first entry is used and the rest is discarded. i decided not to 
use Perl's sort function mainly because for 22m rows, it could be slow. it 
also assume the columns in your data file is separate by a single space.

david

-- 
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to