Counting gaps in sequence data revisited.

Michael S. Robeson II Sun, 17 Oct 2004 08:40:18 -0700

I just wanted to thank everyone for their help and suggestions. This is the final full working code to count continuos gaps in a every sequence in a multi-sequence FASTA file. It may not be elegant but it is fast and works well. Sorry for the long post but I wanted to share this with those that do any DNA work. :-)

I show in order: the format of the input data, the output of the input data, and finally the working script. I have yet to add comments into the code but I am sure many of you veterans will figure it out.

-Thanks again all! As always comments, questions or suggestions are welcome! -Mike


-------- INPUT --------
>dog
atcg--acgat---act-ca----
>cat
acgt-acgt----acgt-gt-agct-
>mouse
---acgtacg-atcg---actgac-


------- OUTPUT ---------
***Discovered the following DNA sequences:***
dog found!
cat found!
mouse found!
                >>>>>> mouse <<<<<<
Indel size:     1       Times found:    2
Positions:
11
25

Indel size:     3       Times found:    2
Positions:
1
16

                >>>>>> cat <<<<<<
Indel size:     1       Times found:    4
Positions:
5
18
21
26

Indel size:     4       Times found:    1
Positions:
10

                >>>>>> dog <<<<<<
Indel size:     1       Times found:    1
Positions:
18

Indel size:     2       Times found:    1
Positions:
5

Indel size:     3       Times found:    1
Positions:
12

Indel size:     4       Times found:    1
Positions:
21

------- Script ------- #!usr/bin/perl # By Michael S. Robeson II, with the help of friends at lernperl.org and bioperl.org! :-) # 10/16/2004

use warnings;
use strict;


###############################
# Open Sequence Data & OUTFILE
###############################

print "Enter in the name of the DNA sequence file:\n";
chomp (my $dna_seq = <STDIN>);

open(DNA_SEQ, $dna_seq)
        or die "Can't open file: $!\n";

open(OUTFILE, ">indel_list_"."$dna_seq")
        or die "Can't open outfile: $!\n";

############################
# Read sequence data into a hash
############################

my %sequences;

$/ = '>';

print "\n***Discovered the following DNA sequences:***\n";

while ( <DNA_SEQ> ) {
        chomp;
        next unless s/^\s*(.+)//;
        my $name = $1;
        s/\s//g;
        $sequences{$name} = $_;
        print "$name found!\n";
}
close DNA_SEQ;

######################################
# iterate over gaps and write to file
######################################

foreach (keys %sequences) {
        print "\t\t\>\>\>\>\>\> $_ \<\<\<\<\<\<\n";
        print OUTFILE "\>\>\>\>\>\> $_ \<\<\<\<\<\<\n";
        my $dna = $sequences{$_};
        my %gap_data;
        my %position;
        while ($dna =~ /(\-+)/g) {
                my $gap_pos = pos ($dna) - length($&) + 1;
                my $gap_length = length $1; #$1 =~ tr/\-+//
                $gap_data{$gap_length}++;
                $position{$gap_length} .= "$gap_pos"." \n";
        }
        
        my @indels = keys (%gap_data);
        my @keys = sort { $a <=> $b} @indels;
        
        foreach my $key (@keys) {
                print "Indel size:\t$key\tTimes found:\t$gap_data{$key}\n";
                print OUTFILE "Indel size:\t$key\tTimes found:\t$gap_data{$key}\n";
                print "Positions:\n";
                print OUTFILE "Positions:\n";
                print "$position{$key}";
                print OUTFILE "$position{$key}";
                print "\n";
                print OUTFILE "\n";
                }
        # Can replace the last "foreach loop" above with the while loop
        # below to do the same thing. Only Gap sizes will not be sorted.
        # nor is it set up to print to a file
        #       
        # while (my ($key, $vlaue)  = each (%gap_data)) {
        #       print "Indel size:\t$key\tTimes found:\t$gap_data{$key}\n";
        #       print "Positions:\n";
        #       print "$position{$key}";
        #       print "\n\n";
        # }

}


--
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
<http://learn.perl.org/> <http://learn.perl.org/first-response>

Counting gaps in sequence data revisited.

Reply via email to