On Mon, Jun 11, 2012 at 4:31 PM, venkates <venka...@nt.ntnu.no> wrote:
> Hi all, > > I am trying to filter files from a directory (code provided below) by > comparing the contents of each file with a hash ref (a parsed id map file > provided as an argument). The code is working however, is extremely slow. > The .csv files (81 files) that I am reading are not very large (largest > file is 183,258 bytes). I would appreciate if you could suggest > improvements to the code. > > sub filter { > my ( $pazar_dir_path, $up_map, $output ) = @_; > croak "Not enough arguments! " if ( @_ < 3 ); > > my $accepted = 0; > my $rejected = 0; > > opendir DH, $pazar_dir_path or croak ("Error in opening directory > '$pazar_dir_path': $!"); > open my $OUT, '>', $output or croak ("Cannot open file for writing > '$output': $!"); > while ( my @data_files = grep(/\.csv$/,readdir(DH)) ) { > my @records; > foreach my $file ( @data_files ) { > open my $FH, '<', "$pazar_dir_path/$file" or croak ("Cannot > open file '$file': $!"); > while ( my $data = <$FH> ) { > chomp $data; > my $record_output; > @records = split /\t/, $data; > foreach my $up_acs ( keys %{$up_map} ) { > foreach my $ensemble_id ( @{$up_map->{$up_acs}{'Ensembl_ > **TRS'}} ){ > if ( $records[1] eq $ensemble_id ) { > $record_output = join( "\t", @records ); > print $OUT "$record_output\n"; > $accepted++; > } > else { > $rejected++; > next; > } > } > } > } > close $FH; > } > } > close $OUT; > closedir (DH); > print "accepted records: $accepted\n, rejected records: $rejected\n"; > return $output; > } > > __DATA__ > > TF0000210 ENSMUST00000001326 SP1_MOUSE GS0000422 > ENSMUSG00000037974 7 148974877 149005136 Mus musculus > MUC5AC 14570593 ELECTROPHORETIC MOBILITY SHIFT ASSAY > (EMSA)::SUPERSHIFT > TF0000211 ENSMUST00000066003 SP3_MOUSE GS0000422 > ENSMUSG00000037974 7 148974877 149005136 Mus musculus > MUC5AC 14570593 ELECTROPHORETIC MOBILITY SHIFT ASSAY > (EMSA)::SUPERSHIFT > > > Thanks a lot, > > Aravind > > -- > To unsubscribe, e-mail: beginners-unsubscr...@perl.org > For additional commands, e-mail: beginners-h...@perl.org > http://learn.perl.org/ > > > Hi Aravind, Even though I don't have time to go into the code and really find out why I can instantly see a bad thing and in peudo code it looks like: Loop (1) { Loop (2) { Loop (3) { } } } To me that looks like if I have to loop over 81 files like you said and each of the underlying loops has 81 actions as well then I am doing 81x81x81= 531441 times what ever loop (3) does and 6561 times what ever loop (2) does as well as 81 times what ever loop (1) does. To make this a lot faster try and pull the loops appart. You are looping over a directory, simply het the list and stop te loop, then open each file seperately and only after youa re sure it is a useful file start processing the contents of the file. Try and to limit as much as possible the construct where you have a loop inside a loop. You seem to be parsing the files line by line as you are saying they are all very small you might want to pull them in in one go and then thread them as one long string rather then processing them line by line. Then there is one more thing I saw and don't quite understand you take your hash and compare every key in your hash with each and every value in your file. Why? Why not simply take the value in your file and see if it has a matching key in the hash (a lot faster then your way). As you are only comparing the two values you can simply ask the hash if key X exists if it does you can process if not reject it instantly. As a hash is optimized for this work you skip the need to loop over the entire hash every single time. In very short if you want speed writting a loop in a loop in a loop in a loop is never a good idea. If you have to a loop in a loop is possible more thn that should be avoided as much as possible. Regards, Rob Coops