Hello,

Use bioperl (http://www.bioperl.org/wiki/Main_Page) for this task.

This should do what you want:

#!/usr/bin/perl

use strict;
use warnings;
use Bio::SeqIO;

my $fastaFile = 'myfile';
my $pattern = 'CTTGGCGAGAAGGGCCGCTACCTGCTGGCCGCCTCCTTCGGCAACGT';
my $blockThreshold = '500';
my $numFasta = '0';
my $match = '0';

my $q = Bio::SeqIO->new(-format=>'fasta', -file=>$fastaFile, -alphabet=>'dna');
while (my $seq = $q->next_seq()){
   ++$numFasta;
   if ($seq->seq() =~ /$pattern/){
      ++$match;
   }
}
$q->close();

print "Max number of blocks to search: $blockThreshold\n";
print "Number of blocks found in this file: $numFasta\n";
print "Total matches in $blockThreshold blocks: $match\n";

__END__

Dave

On Fri, 24 Oct 2008 03:38:33 +1000, minky arora <[EMAIL PROTECTED]> wrote:

Dear Gurus,

I have a file of the follwoing form

FFM50HR02GMY4E length=75 xy=2604_3772 region=2 run=R_2008_08_19_08_32_31_

GGGGTCAATGGGTCCGACGGAGAAAGCGCGACAGAGGGGAAAGCCCTTTCCCCTCCCCGT

TCGACTAGCGTCGTG

FFM50HR02F5QTS length=59 xy=2408_2686 region=2 run=R_2008_08_19_08_32_31_

AGGACATGCGGCCCGGCGACCTCATCATCTACTTCGACGACGCCAGCCACGTCGGGATG





It has over 5000 such blocks, each starting with ">". I need to search for a
given pattern (String of characters) in the second line of each block and
then print the block header (>FFM50HR02F5QTS). I only need to parse the
first 500 blocks of each file. Of these 500 blocks, I then need to output
the number of times the pattern has occured. My code is below. I didn't
think I has missed anythign till I manually went into each file to compare
the results, which don't match. Can someone point me to whats going wrong
here?







#!/usr/bin/perl

$file_to_parse = "/home/myfile";
$pattern = "CTTGGCGAGAAGGGCCGCTACCTGCTGGCCGCCTCCTTCGGCAACGT";
#$pattern = "abc";

$max_blocks = 500;

# open the data file
open (DAT, "$file_to_parse") || die ("Cannot open file: $file_to_parse");
$match_count = 0;
$block_count = 0;
$block = "";
while (<DAT>){

 chomp (); #remove newline characters

 if ($_ =~ /^>/ && $. > 0){ #beginning of the next block reached

  #look for matches in the current block

  if ($block_count <= $max_blocks){ # check not more than $max_blocks

   $num_matches = () = $block =~ /$pattern/g; #how many matches in this
block
   $match_count += $num_matches; #increase global match coutner

   $block =~ /^(>.+?)\s/g; #get block ID, e.g. >FIFKRKM06HCSVV
   $block_id = $1;

   if ($num_matches > 0){ #output information
print "Block ID: $block_id\nBlock #: $block_count\nNumber of matches in
this block: $num_matches\n\n";
   }

  }

  $block = ""; #empty block holder variable
  $block_count++; #increase block count


 }

 #build the block, concatenate lines
 $block .= $_;

}
close DAT;

print "Max number of blocks to search: $max_blocks\n";
print "Number of blocks found in this file: $block_count\n";
print "Total matches in $max_blocks blocks: $match_count\n\n";

# exit
exit;



--
Using Opera's revolutionary e-mail client: http://www.opera.com/mail/


--
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
http://learn.perl.org/


Reply via email to