Hi David, how are U... I am using u r program for sorting... below is u r program (at the end of mail)... I am sorting 7.5 GB file with this program... it has 13 millions of records...
i changed u r program to following if(@buffer > 500000){ my $tmp = "tmp" . $counter++ . ".txt"; ..... } following are the statistics... it took 5:30 hours to sort 13 millions record file... on 8 CPU's and 8GB RAM how to improve the speed.... i think 5 hours is more.... Thanx -Madhu --- david <[EMAIL PROTECTED]> wrote: > Madhu Reddy wrote: > > > Hi, > > I want to sort a file and want to write the > result > > to same file.... > > I want to sort a based on 3rd column.. > > > > following is my file format > > > > C1 C2 C3 C4 > > 1234 guhr 89890 uierfer > > 1324 guii 60977 hiofver > > 5467 frwf 56576 errtttt > > > > > > i want to sort above file based on column 3(C3) > > and i want to write sorted result to same file.... > > > > After sorting my file should be > > > > 5467 frwf 56576 errtttt > > 1324 guii 60977 hiofver > > 1234 guhr 89890 uierfer > > > > > > > > how to do this ? > > file may have around 20 millions rows ...... > > > > if you are using the *nix os, you should try the > sort utility. if you are > not using *nix and you don't have the sort utility, > you will have to rely > on Perl's sort function. with 20m rows, you probably > don't want to store > everything in memory and then sort them. what you > have to do is sort the > data file segment by segment and then merge them > back. merging is the real > tricky business. the following script(which i did > for someone a while ago) > will do that for you. what it does is break the file > into multiple chunks of > 100000 lines, sort the chunks in a disk tmp file and > then merge all the > chunks back together. when i sort the file, i keep > the smallest boundary of > each chunk and use this number to sort the file so > you don't have to > compare all the tmp files. > > #!/usr/bin/perl -w > use strict; > > my @buffer = (); > my @tmps = (); > my %bounds = (); > my $counter = 0; > > open(FILE,"file.txt") || die $!; > while(<FILE>){ > push(@buffer,$_); > if(@buffer > 100000){ > my $tmp = "tmp" . $counter++ . > ".txt"; > push(@tmps,$tmp); > sort_it([EMAIL PROTECTED],$tmp); > @buffer = (); > } > } > close(FILE); > > merge_it(\%bounds); > unlink(@tmps); > > #-- DONE --# > > sub sort_it{ > my $ref = shift; > my $tmp = shift; > my $first = 1; > open(TMP,">$tmp") || die $!; > for(sort {my @fields1 = split(/\s/,$a); > my @fields2 = split(/\s/,$b); > $fields1[2] <=> $fields2[2] } > @{$ref}){ > if($first){ > $bounds{$tmp} = > (split(/\s/))[2]; > $first = 0; > } > print TMP $_; > } > close(TMP); > } > > sub merge_it{ > my $ref = shift; > my @files = sort {$ref->{$a} <=> $ref->{$b}} > keys %{$ref}; > my $merged_to = $files[0]; > for(my $i=1; $i<@files; $i++){ > open(FIRST,$merged_to) || dir $!; > open(SECOND,$files[$i]) || dir $!; > my $merged_tmp = "merged_tmp$i.txt"; > open(MERGED,">$merged_tmp") || die > $!; > my $line1 = <FIRST>; > my $line2 = <SECOND>; > while(1){ > if(!defined($line1) && > defined($line2)){ > print MERGED $line2; > print MERGED > while(<SECOND>); > last; > } > if(!defined($line2) && > defined($line1)){ > print MERGED $line1; > print MERGED > while(<FIRST>); > last; > } > last if(!defined($line1) && > !defined($line2)); > my $value1 = > (split(/\s/,$line1))[2]; > my $value2 = > (split(/\s/,$line2))[2]; > if($value1 == $value2){ > print MERGED $line1; > print MERGED $line2; > $line1 = <FIRST>; > $line2 = <SECOND>; > }elsif($value1 > $value2){ > while($value1 > > $value2){ > print MERGED > $line2; > $line2 = > <SECOND>; > last > unless(defined $line2); > $value2 = > (split(/\s/,$line2))[2]; > } > }else{ > while($value1 < > $value2){ > print MERGED > $line1; > $line1 = > <FIRST>; > last > unless(defined $line1); > $value1 = > (split(/\s/,$line1))[2]; > } > } > } > close(FIRST); > close(SECOND); > close(MERGED); > $merged_to = $merged_tmp; > } > } > > __END__ > > after the script finish, you wil notice some files > named > merged_tmp<number>.txt. if you look at the > merged_tmp<largest number>.txt, > you should see your original files are sorted in > this file. i decided not to > delete those merged_tmp files so you can see exactly > how each chunk is > sorted one by one. great for debug. i omitted a lot > of error checks which > you should add if you decided to use the script. it > can sort extrememly > large file without using a lot of memory but it does > use up your disk space > and it isn't very fast. finally, if you found the > script not working, > please let me know so i can fix it. > > david > > -- > To unsubscribe, e-mail: > [EMAIL PROTECTED] > For additional commands, e-mail: > [EMAIL PROTECTED] > __________________________________________________ Do you Yahoo!? Yahoo! Tax Center - forms, calculators, tips, more http://taxes.yahoo.com/ -- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]