On Sat, Apr 23, 2011 at 10:27:07AM -0400, galeb abu-ali wrote:
> Hi,
>
> I'm trying to parse a table containing information about genes in a
> bacterial chromosome. Below is a sample for one gene, and there's about 4500
> such blocks in a file:
<snip>
> My problem is that some genes have 2 entries for COG_category, some only one
> and others none. I took a look at perldsc and tried to fit the table into
> one of the complex structures but didn't get far. Below is the code I came
> up with so far:
<snip>
#!/usr/bin/perl
# parse_IMG_gene_info.pl
use strict; use warnings;
# '##' marks the lines of you code I changed/removed.
# One of these days you're going to want this program
# to report on more than one file.
# So wrap 'for my $file ( @ARGV ) { ... }' around it
# And open( my $IN, '<', $file or die "Failed to open $file: $!\n";
open( IN, "<", @ARGV ) or die "Failed to open: $!\n";
# Put this print statement down by the other. Keeping functional groups
# together aids understanding and it'll all be there when you realize
# the task has grown to the point you want an output routine.
print "locus\tCOG_category\tCOG_category\tCOGID\tCluster_Information\n\n";
# In Perl it's customary to declare your variables just before you
# need them and makes it easier to notice that %locus & $e aren't used
my( %locus, @cogs, %cog_cat, %cog_id, $oid, $locus, $source,
$cluster_info, $e);
# Keeping functional groups together suggests:
# open...; while(...); close;
# with nothing else intervening. Again $IN.
# And while( my $line = <$IN> ) protects you from the many things
# that change $_.
# If you don't get out of the habit of using $_ it 'will' bite you.
# BTW, chomp, split, /COG_category/ and several other functions
# act on $_ by default so 'split "\t", $_' and 'split "\t"
# are equivalent. note also that 'split /\t/' is preferable.
while( <IN> ) {
chomp; # remove linefeeds from $cluster_info
if( $_=~ /COG_category/ ) {
## ( $oid, $locus, $source, $cluster_info ) = split "\t", $_;
# the tabs got lost in the email and it was easier for me to
# change the split than change the data file.
( $oid, $locus, $source, $cluster_info ) = split / +/, $_;
# When you found the 2nd cog_cat you overwrote the first
## $cog_cat{ $locus } = $cluster_info;
push @{ $cog_cat{ $locus } }, $cluster_info if($cluster_info);
# You never used this and it's an array of hashes when what I think
# you need is an array in your hash
## push( @cogs, { %cog_cat } );
} elsif ( $_=~ /COG\d+/ ) {
#( $oid, $locus, $source, $cluster_info ) = split "\t", $_;
( $oid, $locus, $source, $cluster_info ) = split / +/, $_;
$cog_id{ $locus } = $cluster_info;
}
}
close IN;
#print scalar @cogs, "\n";
# Uncomment and take a look at the output of the next 2 lines
# and I think you'll see where you could use a single hash
# use Data::Dumper;
# print Dumper \%cog_cat, \%cog_id ;
for my $test( sort keys %cog_cat ) {
## print "$test\t$cog_cat{ $test }\t$cog_id{ $test }\n";
print $test, map {"\t$_"} @{$cog_cat{ $test }}, "\t$cog_id{ $test }\n";
}
print "\n";
__END__
This is the output I got running this against data from one of your
earlier posts. I think it's what you're looking for except for
an extra tab that I don't see where is coming from.
Challenge for the student? :)
locus COG_category COG_category COGID Cluster_Information
SeSA_B0001 [T] Signal transduction mechanisms [K] Transcription
SOS-response transcriptional repressors (RecA-mediated autopeptidases)
SeSA_B0002 [L] Replication, recombination and repair
Nucleotidyltransferase/DNA polymerase involved in DNA repair
HTH,
Mike
--
Satisfied user of Linux since 1997.
O< ascii ribbon campaign - stop html mail - www.asciiribbon.org
--
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]
http://learn.perl.org/