Hello,
Use bioperl (http://www.bioperl.org/wiki/Main_Page) for this task.
This should do what you want:
#!/usr/bin/perl
use strict;
use warnings;
use Bio::SeqIO;
my $fastaFile = 'myfile';
my $pattern = 'CTTGGCGAGAAGGGCCGCTACCTGCTGGCCGCCTCCTTCGGCAACGT';
my $blockThreshold = '500';
my $numFasta = '0';
my $match = '0';
my $q = Bio::SeqIO->new(-format=>'fasta', -file=>$fastaFile,
-alphabet=>'dna');
while (my $seq = $q->next_seq()){
++$numFasta;
if ($seq->seq() =~ /$pattern/){
++$match;
}
}
$q->close();
print "Max number of blocks to search: $blockThreshold\n";
print "Number of blocks found in this file: $numFasta\n";
print "Total matches in $blockThreshold blocks: $match\n";
__END__
Dave
On Fri, 24 Oct 2008 03:38:33 +1000, minky arora <[EMAIL PROTECTED]>
wrote:
Dear Gurus,
I have a file of the follwoing form
FFM50HR02GMY4E length=75 xy=2604_3772 region=2
run=R_2008_08_19_08_32_31_
GGGGTCAATGGGTCCGACGGAGAAAGCGCGACAGAGGGGAAAGCCCTTTCCCCTCCCCGT
TCGACTAGCGTCGTG
FFM50HR02F5QTS length=59 xy=2408_2686 region=2
run=R_2008_08_19_08_32_31_
AGGACATGCGGCCCGGCGACCTCATCATCTACTTCGACGACGCCAGCCACGTCGGGATG
It has over 5000 such blocks, each starting with ">". I need to search
for a
given pattern (String of characters) in the second line of each block and
then print the block header (>FFM50HR02F5QTS). I only need to parse the
first 500 blocks of each file. Of these 500 blocks, I then need to output
the number of times the pattern has occured. My code is below. I didn't
think I has missed anythign till I manually went into each file to
compare
the results, which don't match. Can someone point me to whats going wrong
here?
#!/usr/bin/perl
$file_to_parse = "/home/myfile";
$pattern = "CTTGGCGAGAAGGGCCGCTACCTGCTGGCCGCCTCCTTCGGCAACGT";
#$pattern = "abc";
$max_blocks = 500;
# open the data file
open (DAT, "$file_to_parse") || die ("Cannot open file: $file_to_parse");
$match_count = 0;
$block_count = 0;
$block = "";
while (<DAT>){
chomp (); #remove newline characters
if ($_ =~ /^>/ && $. > 0){ #beginning of the next block reached
#look for matches in the current block
if ($block_count <= $max_blocks){ # check not more than $max_blocks
$num_matches = () = $block =~ /$pattern/g; #how many matches in this
block
$match_count += $num_matches; #increase global match coutner
$block =~ /^(>.+?)\s/g; #get block ID, e.g. >FIFKRKM06HCSVV
$block_id = $1;
if ($num_matches > 0){ #output information
print "Block ID: $block_id\nBlock #: $block_count\nNumber of matches
in
this block: $num_matches\n\n";
}
}
$block = ""; #empty block holder variable
$block_count++; #increase block count
}
#build the block, concatenate lines
$block .= $_;
}
close DAT;
print "Max number of blocks to search: $max_blocks\n";
print "Number of blocks found in this file: $block_count\n";
print "Total matches in $max_blocks blocks: $match_count\n\n";
# exit
exit;
--
Using Opera's revolutionary e-mail client: http://www.opera.com/mail/
--
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
http://learn.perl.org/