Christiane Nerz wrote: > > Good morning! Hello,
> Thx for your help.. > > The problem seems to lay in "filling the hash". > But I can't see why. > > I want to compare two fasta-files, more precisely the IDs of two sets > of sequences. I take it that the ID is everything between ' >' and "\n"? > Each file looks like: > > >gi|13699918|dbj|BAB41217.1|..... > MSEKEIWEKVLEIAQEKLSAVSYSTFLKDTELYTIKDGEAIVLSSIPFNANWLNQQYAEIIQAILFDVVG > YEVKPHFITTEELANYSNNETATPKETTKPSTETTEDNHVLGREQFNAHNTFDTFVIGPGNRFPHAASLA > VAEAPAKAYNPLFIYGGVGLGKTHLMHAIGHHVLDNNPDA...... > >gi|13699919|dbj|BAB41218.1|.... > QFQTLITSGHSEFNLSGLDPDQYPLLPQVSRDDAIQLSVKVLKNVIAQTNFAVSTSETRPVLTGVNWLIQ > ENELICTATDSHRLAVRKLQLEDVSENKNVIIPGKALAE.... > >gi|13699920|dbj|BAB41219.1|..... > MIILVQEVVVEGDINLGQFLKTEGIIESGGQAKWFLQDVEVLINGVRETRRGKKLEHQDRIDIPELPEDA > GSFLIIHQGEQ > >gi|13699921|dbj|BAB41220.1|... > MKLNTLQLENYRNYDEVTLKCHPDVNILIGENAQGKTNLLESIYTLALAKSHRTSNDKELIRFNADYAKI > EGELSYRHGTMPLTMFITKKGKQVKVNHLEQSRLTQYIGHLNVVLFAPEDLNIVKGSPQIRRRFIDMELG > QISAVYLNDLAQYQRILKQKNNYLKQLQLGQ > > The whole code is as follows: > > #!/usr/local/bin/perl > # > ############################################################################### > # add_IDs_by_pattern_matching > # > # Input: Two fastafiles. > # (Some sequences are in both files, but labeled with different IDS) > # Output: One fasta-file, sequences labeled with both IDs. Here you seem to say that you want to compare the sequences and not the IDs? > ############################################################################### > > use strict; > my $fastafilename1 = ''; > my $fastafilename2 = ''; > my %hash_fasta1 = ''; > my %hash_fasta2 = ''; > my $key_fasta1 = ''; > my $key_fasta2 = ''; > my $value_fasta1 = ''; > my $value_fasta2 = ''; > my %hash_pattern = ''; > my $key_pattern = ''; > my $value_pattern = ''; > > print "Please type the name of the first file:\n"; > chomp ($fastafilename1 = <STDIN>); > open(FASTAFILE1, $fastafilename1) || > die("Cannot open file for reading: $!"); > > #read in the first file > #and filling two hashes > #values in hash_pattern later used for pattern-matching > > my $k = 0; > my $i = 0; > while (<FASTAFILE1>) { > if(/^>/) { Isn't there a space before the '>' character? > chomp; > $i =1; > if ($k==1) { > chomp; You have already chomped the line, there is no point in chomping it again. > $hash_fasta1{$key_fasta1} = $value_fasta1; > $value_fasta1 =''; > } > else { > $key_fasta1 = $_; > $key_pattern = $_; > } > } > else { > if ($i==0) { > chomp; You have already chomped the line, there is no point in chomping it again. > $value_fasta1 = $value_fasta1 . $_; > $k=1; > } > else { > $i = 0; > chomp; You have already chomped the line, there is no point in chomping it again. > $hash_pattern{$key_pattern} = $_; > $value_fasta1 = $value_fasta1 . $_; > $k=1; > } > } > } > > $hash_fasta1{$key_fasta1} = $value_fasta1; > delete $hash_fasta1 {''}; > delete $hash_pattern {''}; You should probably avoid adding that key in the first place instead of deleting it later. > close(FASTAFILE1) || die("Can't close in file: $!") ; > > #read in the second file > > print "Please type the name of the first file:\n\n"; Don't you mean the second file? :-) > chomp ($fastafilename2 = <STDIN>); > open(FASTAFILE2, $fastafilename2) || > die("Cannot open file for reading: $!"); > > my $j = 0; > while (<FASTAFILE2>) { > if(/^>/) { > chomp; > if ($j==1) { > chomp; > $hash_fasta2{$key_fasta2} = $value_fasta2; > $value_fasta2 =''; > } > else { > $key_fasta2 = $_; > } > } else { > chomp; > $value_fasta2 = $value_fasta2 . $_; > $j=1; > } > } > > $hash_fasta2{$key_fasta2} = $value_fasta2; > delete $hash_fasta2 {''}; > close(FASTAFILE2) || die("Can't close in file: $!") ; > > my $outputfile = ''; > > #open outputfile > $outputfile = "both_IDs"; > unless (open(BOTH_IDS, ">$outputfile") ) { > print "Cannot open file \"$outputfile\" to write to!!\n\n"; > exit; > } > > my @array1 = keys %hash_fasta1; > my @array2 = keys %hash_fasta2; > > ################################################################## > # Because I only found one sequence, which is in both fasta-files, > # I tried to find out, if the hashes are correctly filled. > # So here I put the code as described** and got the different > # output. Rest of code: > > my $key_hash1 = ''; > my $key_hash2 = ''; > > #pattern-matching > > foreach (@array1) { > $key_hash1 = $_; foreach $key_hash1 (@array1) { > foreach (@array2) { > $key_hash2 = $_; foreach $key_hash2 (@array2) { > if ($hash_fasta2{$key_hash2} =~ $hash_pattern{$key_hash1}) { You could use index() to do that and you wouldn't have to worry about regex meta-characters. > print "Pattern $hash_pattern{$key_hash1} is in > $hash_fasta2{$key_hash2} \n"; > print BOTH_IDS $key_hash1 . $key_hash2 . > $hash_fasta2{$key_hash2}; > } > } > } > > #close outputfile > close (BOTH_IDS) || die("Can't close in file: $!"); If you want to associate all sequences with their corresponding IDs then use a hash of arrays: #!/usr/local/bin/perl use warnings; use strict; print "Please type the name of the first file:\n"; chomp( @ARGV = scalar <STDIN> ); print "Please type the name of the second file:\n\n"; chomp( $ARGV[ 1 ] = <STDIN> ); my %hash_fasta; $/ = ' >'; while ( <> ) { my ( $id, $seq ) = split( /\n/, $_, 2 ) or next; $seq =~ tr/A-Z//cd; # remove non-sequence characters push @{ $hash_fasta{ $id } }, $seq; } for my $id ( keys %hash_fasta ) { print " >$id\n"; print "$_\n" for @{ $hash_fasta{ $id } }; } __END__ John -- use Perl; program fulfillment -- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]