Christiane Nerz wrote:
> 
> Good morning!

Hello,

> Thx for your help..
> 
> The problem seems to lay in "filling the hash".
> But I can't see why.
> 
> I want to compare two fasta-files,  more precisely the IDs of two sets
> of sequences.

I take it that the ID is everything between ' >' and "\n"?


> Each file looks like:
> 
>  >gi|13699918|dbj|BAB41217.1|.....
> MSEKEIWEKVLEIAQEKLSAVSYSTFLKDTELYTIKDGEAIVLSSIPFNANWLNQQYAEIIQAILFDVVG
> YEVKPHFITTEELANYSNNETATPKETTKPSTETTEDNHVLGREQFNAHNTFDTFVIGPGNRFPHAASLA
> VAEAPAKAYNPLFIYGGVGLGKTHLMHAIGHHVLDNNPDA......
>  >gi|13699919|dbj|BAB41218.1|....
> QFQTLITSGHSEFNLSGLDPDQYPLLPQVSRDDAIQLSVKVLKNVIAQTNFAVSTSETRPVLTGVNWLIQ
> ENELICTATDSHRLAVRKLQLEDVSENKNVIIPGKALAE....
>  >gi|13699920|dbj|BAB41219.1|.....
> MIILVQEVVVEGDINLGQFLKTEGIIESGGQAKWFLQDVEVLINGVRETRRGKKLEHQDRIDIPELPEDA
> GSFLIIHQGEQ
>  >gi|13699921|dbj|BAB41220.1|...
> MKLNTLQLENYRNYDEVTLKCHPDVNILIGENAQGKTNLLESIYTLALAKSHRTSNDKELIRFNADYAKI
> EGELSYRHGTMPLTMFITKKGKQVKVNHLEQSRLTQYIGHLNVVLFAPEDLNIVKGSPQIRRRFIDMELG
> QISAVYLNDLAQYQRILKQKNNYLKQLQLGQ
> 
> The whole code is as follows:
> 
> #!/usr/local/bin/perl
> #
> ###############################################################################
> # add_IDs_by_pattern_matching
> #
> # Input: Two fastafiles.
> # (Some sequences are in both files, but labeled with different IDS)
> # Output: One fasta-file, sequences labeled with both IDs.

Here you seem to say that you want to compare the sequences and not the
IDs?


> ###############################################################################
> 
> use strict;
> my $fastafilename1 = '';
> my $fastafilename2 = '';
> my %hash_fasta1 = '';
> my %hash_fasta2 = '';
> my $key_fasta1 = '';
> my $key_fasta2 = '';
> my $value_fasta1 = '';
> my $value_fasta2 = '';
> my %hash_pattern = '';
> my $key_pattern = '';
> my $value_pattern = '';
> 
> print "Please type the name of the first file:\n";
> chomp ($fastafilename1 = <STDIN>);
> open(FASTAFILE1, $fastafilename1) ||
>          die("Cannot open file for reading: $!");
> 
> #read in the first file
> #and filling two hashes
> #values in hash_pattern later used for pattern-matching
> 
> my $k = 0;
> my $i = 0;
> while (<FASTAFILE1>) {
>      if(/^>/) {

Isn't there a space before the '>' character?

>          chomp;
>          $i =1;
>          if ($k==1) {
>              chomp;

You have already chomped the line, there is no point in chomping it
again.

>              $hash_fasta1{$key_fasta1} = $value_fasta1;
>              $value_fasta1 ='';
>          }
>          else {
>          $key_fasta1 = $_;
>          $key_pattern = $_;
>          }
>      }
>      else {
>          if ($i==0)  {
>                 chomp;

You have already chomped the line, there is no point in chomping it
again.

>                 $value_fasta1 = $value_fasta1 . $_;
>                 $k=1;
>          }
>          else   {
>          $i = 0;
>          chomp;

You have already chomped the line, there is no point in chomping it
again.

>          $hash_pattern{$key_pattern} = $_;
>          $value_fasta1 = $value_fasta1 . $_;
>          $k=1;
>          }
>      }
> }
> 
> $hash_fasta1{$key_fasta1} = $value_fasta1;
> delete $hash_fasta1 {''};
> delete $hash_pattern {''};

You should probably avoid adding that key in the first place instead of
deleting it later.

> close(FASTAFILE1) || die("Can't close in file: $!") ;
> 
> #read in the second file
> 
> print "Please type the name of the first file:\n\n";

Don't you mean the second file?  :-)

> chomp ($fastafilename2 = <STDIN>);
> open(FASTAFILE2, $fastafilename2) ||
>          die("Cannot open file for reading: $!");
> 
> my $j = 0;
> while (<FASTAFILE2>) {
>      if(/^>/) {
>          chomp;
>          if ($j==1) {
>              chomp;
>              $hash_fasta2{$key_fasta2} = $value_fasta2;
>              $value_fasta2 ='';
>              }
>          else {
>          $key_fasta2 = $_;
>          }
>      } else {
>          chomp;
>          $value_fasta2 = $value_fasta2 . $_;
>          $j=1;
>      }
> }
> 
> $hash_fasta2{$key_fasta2} = $value_fasta2;
> delete $hash_fasta2 {''};
> close(FASTAFILE2) || die("Can't close in file: $!") ;
> 
> my $outputfile = '';
> 
> #open outputfile
> $outputfile = "both_IDs";
> unless (open(BOTH_IDS, ">$outputfile") ) {
> print "Cannot open file \"$outputfile\" to write to!!\n\n";
> exit;
> }
> 
> my @array1 = keys %hash_fasta1;
> my @array2 = keys %hash_fasta2;
> 
> ##################################################################
> # Because I only found one sequence, which is in both fasta-files,
> # I tried to find out, if the hashes are correctly filled.
> # So here I put the code as described** and got the different
> # output. Rest of code:
> 
> my $key_hash1 = '';
> my $key_hash2 = '';
> 
> #pattern-matching
> 
> foreach (@array1) {
>     $key_hash1 = $_;

foreach $key_hash1 (@array1) {

>      foreach (@array2) {
>          $key_hash2 = $_;

     foreach $key_hash2 (@array2) {

>          if ($hash_fasta2{$key_hash2}  =~ $hash_pattern{$key_hash1}) {

You could use index() to do that and you wouldn't have to worry about
regex meta-characters.

>             print "Pattern $hash_pattern{$key_hash1} is in
> $hash_fasta2{$key_hash2} \n";
>             print BOTH_IDS  $key_hash1 . $key_hash2 .
> $hash_fasta2{$key_hash2};
>         }
>     }
> }
> 
> #close outputfile
> close (BOTH_IDS)  || die("Can't close in file: $!");


If you want to associate all sequences with their corresponding IDs then
use a hash of arrays:

#!/usr/local/bin/perl
use warnings;
use strict;

print "Please type the name of the first file:\n";
chomp( @ARGV = scalar <STDIN> );

print "Please type the name of the second file:\n\n";
chomp( $ARGV[ 1 ] = <STDIN> );

my %hash_fasta;
$/ = ' >';

while ( <> ) {
    my ( $id, $seq ) = split( /\n/, $_, 2 ) or next;
    $seq =~ tr/A-Z//cd;  # remove non-sequence characters
    push @{ $hash_fasta{ $id } }, $seq;
    }

for my $id ( keys %hash_fasta ) {
    print " >$id\n";
    print "$_\n" for @{ $hash_fasta{ $id } };
    }

__END__


John
-- 
use Perl;
program
fulfillment

-- 
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to