Re: how to parse complex table

shawn wilson Sat, 23 Apr 2011 08:10:45 -0700

now, here's a homework question!! :)

On Sat, Apr 23, 2011 at 10:27 AM, galeb abu-ali <abuali...@gmail.com> wrote:
> Hi,
>
> I'm trying to parse a table containing information about genes in a
> bacterial chromosome. Below is a sample for one gene, and there's about 4500
> such blocks in a file:
>
> gene_oid    Locus Tag    Source    Cluster Information    Gene
> Information    E-value
> 642745051    SeSA_B0001    COG_category    [T] Signal transduction
> mechanisms
> 642745051    SeSA_B0001    COG_category    [K] Transcription
> 642745051    SeSA_B0001    COG1974    SOS-response transcriptional
> repressors (RecA-mediated autopeptidases)        2.0e-29
> 642745051    SeSA_B0001    pfam00717    Peptidase_S24        1.7e-13
> 642745051    SeSA_B0001    EC:3.4.21.-    Hydrolases. Acting on peptide
> bonds (peptide hydrolases). Serine endopeptidases.
> 642745051    SeSA_B0001    KO:K03503    DNA polymerase V [EC:3.4.21.-]
>    0.0e+00
> 642745051    SeSA_B0001    ITERM:03797    SOS response UmuD protein. Serine
> peptidase. MEROPS family S24
> 642745051    SeSA_B0001    Locus_type        CDS
> 642745051    SeSA_B0001    NCBI_accession        YP_002112883
> 642745051    SeSA_B0001    Product_name        protein SamA
> 642745051    SeSA_B0001    Scaffold        NC_011092
> 642745051    SeSA_B0001    Coordinates        34..459(+)
> 642745051    SeSA_B0001    DNA_length        426bp
> 642745051    SeSA_B0001    Protein_length        141aa
> 642745051    SeSA_B0001    GC        .52
>
>
>
>
> I want to parse information for Locus_Tag, Source, and Cluster Info for each
> gene so that the output table looks like this
>
>
> locus    COG_category    COG_category    COGID    Cluster_Information
>
> SeSA_B0001   [T] Signal transduction mechanisms    [K] Transcription
> COG1974    SOS-response transcriptional repressors (RecA-mediated
> autopeptidases)
> SeSA_B0002    "\t" [L] Replication, recombination and repair    COG0389
> Nucleotidyltransferase/DNA polymerase involved in DNA repair
>
>
> My problem is that some genes have 2 entries for COG_category, some only one
> and others none. I took a look at perldsc and tried to fit the table into
> one of the complex structures but didn't get far. Below is the code I came
> up with so far:
>
> #!/usr/bin/perl
> # parse_IMG_gene_info.pl
> use strict;
use warnings;


good, but no need to save space - you have a return key, put different
things on different lines unless you *really* fell it looks / reads
better to do otherwise.

>
>
> open( IN, "<", @ARGV ) or die "Failed to open: $!\n";

open( my $file, "<", $ARGV[ 0 ]) or die ".... $!\n";

>
> print "locus\tCOG_category\tCOG_category\tCOGID\tCluster_Information\n\n";
>
> my( %locus, @cogs, %cog_cat, %cog_id, $oid, $locus, $source, $cluster_info,
> $e );
>
> while( <IN> ) {
>    if( $_=~ /COG_category/ ) {
>        ( $oid, $locus, $source, $cluster_info ) = split "\t", $_;
>        $cog_cat{ $locus } =  $cluster_info;
>        push( @cogs, { %cog_cat } );
>    } elsif ( $_=~ /COG\d+/ ) {
>        ( $oid, $locus, $source, $cluster_info ) = split "\t", $_;
>        $cog_id{ $locus } =  $cluster_info;
>    }
> }
>

i don't really have the knowledge to help here, nor really want to
parse this. instead, i'll suggest using Text::CSV_XS, it's much easier
and will give you a good data structure, all you do to figure out a
column is there is 'if( $csv-[ $col ] ) { ..column has data.. }'

> close IN;

close $file;
or just let it go out of scope and close one its own.

>
> #print scalar @cogs, "\n";
>
> for my $test( sort keys %cog_cat ) {
>    print "$test\t$cog_cat{ $test }\t$cog_id{ $test }\n";
> }
> print "\n";

can i suggest a database? it isn't that hard and will help tons in
future processing of the data and manipulation. also, a quick google
brought up some interesting results on your field:

http://oreilly.com/catalog/begperlbio/chapter/ch10.html
http://search.cpan.org/~mingyiliu/Bio-ASN1-EntrezGene-1.10-withoutworldwriteables/lib/Bio/ASN1/EntrezGene.pm

it might help to look at this (though, i think that Text::CSV will
suite your needs just fine):
http://oreilly.com/catalog/perlsysadm/chapter/ch09.html

--
To unsubscribe, e-mail: beginners-unsubscr...@perl.org
For additional commands, e-mail: beginners-h...@perl.org
http://learn.perl.org/

Re: how to parse complex table

Reply via email to