On Mon, Jun 11, 2012 at 4:31 PM, venkates <venka...@nt.ntnu.no> wrote:

> Hi all,
>
> I am trying to filter files from a directory (code provided below)  by
> comparing the contents of each file with a hash ref (a parsed id map file
> provided as an argument). The code is working however, is extremely slow.
>  The .csv files (81 files) that I am reading are not very large (largest
> file is 183,258 bytes).  I would appreciate if you could suggest
> improvements to the code.
>
> sub filter {
>    my ( $pazar_dir_path, $up_map, $output ) = @_;
>    croak "Not enough arguments! " if ( @_ < 3 );
>
>    my $accepted = 0;
>    my $rejected = 0;
>
>    opendir DH, $pazar_dir_path or croak ("Error in opening directory
> '$pazar_dir_path': $!");
>    open my $OUT, '>', $output or croak ("Cannot open file for writing
> '$output': $!");
>    while ( my @data_files = grep(/\.csv$/,readdir(DH)) ) {
>        my @records;
>        foreach my $file ( @data_files ) {
>            open my $FH, '<', "$pazar_dir_path/$file" or croak ("Cannot
> open file '$file': $!");
>            while ( my $data = <$FH> ) {
>                chomp $data;
>                my $record_output;
>                @records = split /\t/, $data;
>                foreach my $up_acs ( keys %{$up_map} ) {
>                    foreach my $ensemble_id ( @{$up_map->{$up_acs}{'Ensembl_
> **TRS'}} ){
>                        if ( $records[1] eq $ensemble_id ) {
>                            $record_output = join( "\t", @records );
>                            print $OUT "$record_output\n";
>                            $accepted++;
>                        }
>                        else {
>                            $rejected++;
>                            next;
>                        }
>                    }
>                }
>            }
>            close $FH;
>        }
>    }
>    close $OUT;
>    closedir (DH);
>    print "accepted records: $accepted\n, rejected records: $rejected\n";
>    return $output;
> }
>
> __DATA__
>
> TF0000210    ENSMUST00000001326    SP1_MOUSE    GS0000422
>  ENSMUSG00000037974    7    148974877    149005136    Mus musculus
>  MUC5AC    14570593    ELECTROPHORETIC MOBILITY SHIFT ASSAY
> (EMSA)::SUPERSHIFT
> TF0000211    ENSMUST00000066003    SP3_MOUSE    GS0000422
>  ENSMUSG00000037974    7    148974877    149005136    Mus musculus
>  MUC5AC    14570593    ELECTROPHORETIC MOBILITY SHIFT ASSAY
> (EMSA)::SUPERSHIFT
>
>
> Thanks a lot,
>
> Aravind
>
> --
> To unsubscribe, e-mail: beginners-unsubscr...@perl.org
> For additional commands, e-mail: beginners-h...@perl.org
> http://learn.perl.org/
>
>
>
Hi Aravind,

Even though I don't have time to go into the code and really find out why I
can instantly see a bad thing and in peudo code it looks like:
 Loop (1) {
  Loop (2) {
   Loop (3) {
   }
  }
 }
To me that looks like if I have to loop over 81 files like you said and
each of the underlying loops has 81 actions as well then I am doing
81x81x81= 531441 times what ever loop (3) does and 6561 times what ever
loop (2) does as well as 81 times what ever loop (1) does.

To make this a lot faster try and pull the loops appart. You are looping
over a directory, simply het the list and stop te loop, then open each file
seperately and only after youa re sure it is a useful file start processing
the contents of the file. Try and to limit as much as possible the
construct where you have a loop inside a loop.
You seem to be parsing the files line by line as you are saying they are
all very small you might want to pull them in in one go and then thread
them as one long string rather then processing them line by line.

Then there is one more thing I saw and don't quite understand you take your
hash and compare every key in your hash with each and every value in your
file. Why? Why not simply take the value in your file and see if it has a
matching key in the hash (a lot faster then your way). As you are only
comparing the two values you can simply ask the hash if key X exists if it
does you can process if not reject it instantly. As a hash is optimized for
this work you skip the need to loop over the entire hash every single time.

In very short if you want speed writting a loop in a loop in a loop in a
loop is never a good idea. If you have to a loop in a loop is possible more
thn that should be avoided as much as possible.

Regards,

Rob Coops

Reply via email to