now, here's a homework question!! :) On Sat, Apr 23, 2011 at 10:27 AM, galeb abu-ali <abuali...@gmail.com> wrote: > Hi, > > I'm trying to parse a table containing information about genes in a > bacterial chromosome. Below is a sample for one gene, and there's about 4500 > such blocks in a file: > > gene_oid Locus Tag Source Cluster Information Gene > Information E-value > 642745051 SeSA_B0001 COG_category [T] Signal transduction > mechanisms > 642745051 SeSA_B0001 COG_category [K] Transcription > 642745051 SeSA_B0001 COG1974 SOS-response transcriptional > repressors (RecA-mediated autopeptidases) 2.0e-29 > 642745051 SeSA_B0001 pfam00717 Peptidase_S24 1.7e-13 > 642745051 SeSA_B0001 EC:3.4.21.- Hydrolases. Acting on peptide > bonds (peptide hydrolases). Serine endopeptidases. > 642745051 SeSA_B0001 KO:K03503 DNA polymerase V [EC:3.4.21.-] > 0.0e+00 > 642745051 SeSA_B0001 ITERM:03797 SOS response UmuD protein. Serine > peptidase. MEROPS family S24 > 642745051 SeSA_B0001 Locus_type CDS > 642745051 SeSA_B0001 NCBI_accession YP_002112883 > 642745051 SeSA_B0001 Product_name protein SamA > 642745051 SeSA_B0001 Scaffold NC_011092 > 642745051 SeSA_B0001 Coordinates 34..459(+) > 642745051 SeSA_B0001 DNA_length 426bp > 642745051 SeSA_B0001 Protein_length 141aa > 642745051 SeSA_B0001 GC .52 > > > > > I want to parse information for Locus_Tag, Source, and Cluster Info for each > gene so that the output table looks like this > > > locus COG_category COG_category COGID Cluster_Information > > SeSA_B0001 [T] Signal transduction mechanisms [K] Transcription > COG1974 SOS-response transcriptional repressors (RecA-mediated > autopeptidases) > SeSA_B0002 "\t" [L] Replication, recombination and repair COG0389 > Nucleotidyltransferase/DNA polymerase involved in DNA repair > > > My problem is that some genes have 2 entries for COG_category, some only one > and others none. I took a look at perldsc and tried to fit the table into > one of the complex structures but didn't get far. Below is the code I came > up with so far: > > #!/usr/bin/perl > # parse_IMG_gene_info.pl > use strict; use warnings;
good, but no need to save space - you have a return key, put different things on different lines unless you *really* fell it looks / reads better to do otherwise. > > > open( IN, "<", @ARGV ) or die "Failed to open: $!\n"; open( my $file, "<", $ARGV[ 0 ]) or die ".... $!\n"; > > print "locus\tCOG_category\tCOG_category\tCOGID\tCluster_Information\n\n"; > > my( %locus, @cogs, %cog_cat, %cog_id, $oid, $locus, $source, $cluster_info, > $e ); > > while( <IN> ) { > if( $_=~ /COG_category/ ) { > ( $oid, $locus, $source, $cluster_info ) = split "\t", $_; > $cog_cat{ $locus } = $cluster_info; > push( @cogs, { %cog_cat } ); > } elsif ( $_=~ /COG\d+/ ) { > ( $oid, $locus, $source, $cluster_info ) = split "\t", $_; > $cog_id{ $locus } = $cluster_info; > } > } > i don't really have the knowledge to help here, nor really want to parse this. instead, i'll suggest using Text::CSV_XS, it's much easier and will give you a good data structure, all you do to figure out a column is there is 'if( $csv-[ $col ] ) { ..column has data.. }' > close IN; close $file; or just let it go out of scope and close one its own. > > #print scalar @cogs, "\n"; > > for my $test( sort keys %cog_cat ) { > print "$test\t$cog_cat{ $test }\t$cog_id{ $test }\n"; > } > print "\n"; can i suggest a database? it isn't that hard and will help tons in future processing of the data and manipulation. also, a quick google brought up some interesting results on your field: http://oreilly.com/catalog/begperlbio/chapter/ch10.html http://search.cpan.org/~mingyiliu/Bio-ASN1-EntrezGene-1.10-withoutworldwriteables/lib/Bio/ASN1/EntrezGene.pm it might help to look at this (though, i think that Text::CSV will suite your needs just fine): http://oreilly.com/catalog/perlsysadm/chapter/ch09.html -- To unsubscribe, e-mail: beginners-unsubscr...@perl.org For additional commands, e-mail: beginners-h...@perl.org http://learn.perl.org/