Re: how to parse complex table

Mike McClain Sat, 23 Apr 2011 20:54:35 -0700

On Sat, Apr 23, 2011 at 10:27:07AM -0400, galeb abu-ali wrote:
> Hi,
> 
> I'm trying to parse a table containing information about genes in a
> bacterial chromosome. Below is a sample for one gene, and there's about 4500
> such blocks in a file:
<snip> 
> My problem is that some genes have 2 entries for COG_category, some only one
> and others none. I took a look at perldsc and tried to fit the table into
> one of the complex structures but didn't get far. Below is the code I came
> up with so far:
<snip>


#!/usr/bin/perl
# parse_IMG_gene_info.pl
use strict; use warnings;

#   '##' marks the lines of you code I changed/removed.

#   One of these days you're going to want this program 
#       to report on more than one file.
#   So wrap 'for my $file ( @ARGV ) { ... }' around it
#   And open( my $IN, '<', $file   or die "Failed to open $file: $!\n";

open( IN, "<", @ARGV ) or die "Failed to open: $!\n";

#   Put this print statement down by the other. Keeping functional groups
#   together aids understanding and it'll all be there when you realize
#   the task has grown to the point you want an output routine.

print "locus\tCOG_category\tCOG_category\tCOGID\tCluster_Information\n\n";

#   In Perl it's customary to declare your variables just before you 
#   need them and makes it easier to notice that %locus & $e aren't used
my( %locus, @cogs, %cog_cat, %cog_id, $oid, $locus, $source, 
        $cluster_info, $e);

#   Keeping functional groups together suggests: 
#       open...; while(...); close;
#   with nothing else intervening.  Again $IN.
#   And while( my $line = <$IN> ) protects you from the many things
#       that change $_. 
#   If you don't get out of the habit of using $_ it 'will' bite you.
#   BTW, chomp, split, /COG_category/ and several other functions
#       act on $_ by default so 'split "\t", $_' and 'split "\t"
#       are equivalent.     note also that 'split /\t/' is preferable.
while( <IN> ) { 
    chomp;                          #   remove linefeeds from $cluster_info
    if( $_=~ /COG_category/ ) {     
##        ( $oid, $locus, $source, $cluster_info ) = split "\t", $_;

#   the tabs got lost in the email and it was easier for me to
#       change the split than change the data file.
        ( $oid, $locus, $source, $cluster_info ) = split /  +/, $_;
#   When you found the 2nd cog_cat you overwrote the first
##        $cog_cat{ $locus } =  $cluster_info;
        push @{ $cog_cat{ $locus } }, $cluster_info     if($cluster_info);
#   You never used this and it's an array of hashes when what I think 
#       you need is an array in your hash
##        push( @cogs, { %cog_cat } );
    } elsif ( $_=~ /COG\d+/ ) {
        #( $oid, $locus, $source, $cluster_info ) = split "\t", $_;
        ( $oid, $locus, $source, $cluster_info ) = split /  +/, $_;
        $cog_id{ $locus } =  $cluster_info;
    }
}
close IN;

#print scalar @cogs, "\n";

#   Uncomment and take a look at the output of the next 2 lines
#   and I think you'll see where you could use a single hash
# use Data::Dumper;
# print Dumper \%cog_cat, \%cog_id ;

for my $test( sort keys %cog_cat ) {
##    print "$test\t$cog_cat{ $test }\t$cog_id{ $test }\n";
    print $test, map {"\t$_"} @{$cog_cat{ $test }}, "\t$cog_id{ $test }\n";
}
print "\n";
__END__
This is the output I got running this against data from one of your 
earlier posts. I think it's what you're looking for except for
an extra tab that I don't see where is coming from.
Challenge for the student? :)

locus   COG_category    COG_category    COGID   Cluster_Information

SeSA_B0001      [T] Signal transduction mechanisms      [K] Transcription      
SOS-response transcriptional repressors (RecA-mediated autopeptidases)
SeSA_B0002      [L] Replication, recombination and repair               
Nucleotidyltransferase/DNA polymerase involved in DNA repair

HTH, 
Mike
-- 
Satisfied user of Linux since 1997.
O< ascii ribbon campaign - stop html mail - www.asciiribbon.org

-- 
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]
http://learn.perl.org/

Re: how to parse complex table

Reply via email to