On 09/06/2011 09:48, venkates wrote:
> Hi,
> 
> data snippet:
> 
> ENTRY K00002 KO
> NAME E1.1.1.2, adh
> DEFINITION alcohol dehydrogenase (NADP+) [EC:1.1.1.2]
> PATHWAY ko00010 Glycolysis / Gluconeogenesis
> ko00561 Glycerolipid metabolism
> ko00930 Caprolactam degradation
> CLASS Metabolism; Carbohydrate Metabolism; Glycolysis / Gluconeogenesis 
> [PATH:ko00010]
> Metabolism; Lipid Metabolism; Glycerolipid metabolism [PATH:ko00561]
> Metabolism; Xenobiotics Biodegradation and Metabolism; Caprolactam 
> degradation [PATH:ko00930]
> DBLINKS RN: R00746 R01041 R05231
> COG: COG0656
> GO: 0008106
> GENES HSA: 10327(AKR1A1)
> PTR: 741418(AKR1A1)
> PON: 100173796(AKR1A1)
> MCC: 693380(AKR1A1)
> MMU: 58810(Akr1a4)
> RNO: 78959(Akr1a1)
> CFA: 610537
> ///
> ENTRY K00730 KO
> NAME OST4
> DEFINITION oligosaccharyl transferase complex subunit OST4
> PATHWAY ko00510 N-Glycan biosynthesis
> ko00513 Various types of N-glycan biosynthesis
> ko04141 Protein processing in endoplasmic reticulum
> MODULE M00072 Oligosaccharyltransferase
> CLASS Metabolism; Glycan Biosynthesis and Metabolism; N-Glycan 
> biosynthesis [PATH:ko00510]
> Metabolism; Glycan Biosynthesis and Metabolism; Various types of 
> N-glycan biosynthesis [PATH:ko00513]
> Genetic Information Processing; Folding, Sorting and Degradation; 
> Protein processing in endoplasmic reticulum [PATH:ko04141]
> DBLINKS GO: 0008250
> GENES SCE: YDL232W(OST4)
> AGO: AGOS_ABL170C
> KLA: KLLA0A01287g
> VPO: Kpol_1054p35
> SSL: SS1G_13465
> REFERENCE PMID:15001703
> AUTHORS Zubkov S, Lennarz WJ, Mohanty S
> TITLE Structural basis for the function of a minimembrane protein 
> subunit of yeast oligosaccharyltransferase.
> JOURNAL Proc Natl Acad Sci U S A 101:3821-6 (2004)
> ///
> 
> I need to retrieve all the gene entries to add it to a hash ref. My code 
> does that in the first record but in the second case it also pulls out 
> the REFERENCE information. I have provided the code below. If some one 
> could tell me where exactly I am going wrong (is it in the regex? or 
> otherwise) I would be glad!!
> 
> code :
> 
> use strict;
> use warnings;
> use Carp;
> use Data::Dumper;
> 
> 
> my $set = parse("/home/venkates/workspace/KEGG_Parser/data/ko");
> 
> sub parse {
> 
> my $kegg_file_path = shift;
> my $keggData; # Hash ref
> 
> open my $fh, '<', $kegg_file_path or croak("Cannot open file 
> '$kegg_file_path': $!");
> local $/ = "\n///\n";
> while (<$fh>){
> chomp;
> my $record = $_;
> $record =~ m/^ENTRY\s{7}(.+?)\s+/xms;
> my $entries = $1;
> if ($record =~ m/^GENES\s{7}(.+)$/xms){
> my $gene = $1;
> ${$keggData}{$entries}{'GENE'} = $gene;
> my @genes = split ('\s{13}', $gene);
> foreach my $gene_element (@genes){
> my $taxon_label = substr($gene_element, 0, 3);
> my $gene_label = substr($gene_element, 5);
> my @gene_label_array = split '\s', $gene_label;
> push @{${$keggData}{$entries}{'GENES'}{$taxon_label}}, @gene_label_array;
> }
> }
> 
> }
> print Dumper($keggData);
> close $fh;
> }

I would prefer to read the file a line at a time. The code below seems 
to do what you want.

HTH,

Rob


use strict;
use warnings;

use Data::Dumper;

my $kegg_file = '/home/venkates/workspace/KEGG_Parser/data/ko';

my $fh;
unless (open $fh, $kegg_file) {
  warn "Failed to open file: $!. Defaulting to DATA.";
  $fh = *DATA;
} 

parse($fh);

sub parse {

  my $kegg_file_handle = shift;
  my $keggData;
    
  my $entry;
  my $key;

  while (<$fh>) {
       
    next unless /\S/;
    if (m|///|) {
       undef $entry;
       undef $key;
       next;
    }

    chomp;
    
    next unless m|^(.{0,11}?)\s+(.+)|;

    $key = $1 if $1;
    my $val = $2;

    if ($key eq 'ENTRY') {
      ($entry) = $val =~ /(\S+)/;
    }
    elsif ($key eq 'GENES') {
      die "No current entry" unless $entry;
      my ($taxon_label, @gene_label_array) = split /:?\s+/, $val;
      push @{$keggData->{$entry}{$key}{$taxon_label}}, @gene_label_array;
    }
  }

  print Dumper($keggData);
}

__DATA__
ENTRY       K00002                      KO
NAME        E1.1.1.2, adh
DEFINITION  alcohol dehydrogenase (NADP+) [EC:1.1.1.2]
PATHWAY     ko00010  Glycolysis / Gluconeogenesis
            ko00561  Glycerolipid metabolism
            ko00930  Caprolactam degradation
CLASS       Metabolism; Carbohydrate Metabolism; Glycolysis / Gluconeogenesis 
[PATH:ko00010]
            Metabolism; Lipid Metabolism; Glycerolipid metabolism [PATH:ko00561]
            Metabolism; Xenobiotics Biodegradation and Metabolism; Caprolactam 
degradation [PATH:ko00930]
DBLINKS     RN: R00746 R01041 R05231
            COG: COG0656
            GO: 0008106
GENES       HSA: 10327(AKR1A1)
            PTR: 741418(AKR1A1)
            PON: 100173796(AKR1A1)
            MCC: 693380(AKR1A1)
            MMU: 58810(Akr1a4)
            RNO: 78959(Akr1a1)
            CFA: 610537
///
ENTRY       K00730                      KO
NAME        OST4
DEFINITION  oligosaccharyl transferase complex subunit OST4
PATHWAY     ko00510  N-Glycan biosynthesis
            ko00513  Various types of N-glycan biosynthesis
            ko04141  Protein processing in endoplasmic reticulum
MODULE      M00072  Oligosaccharyltransferase
CLASS       Metabolism; Glycan Biosynthesis and Metabolism; N-Glycan 
biosynthesis [PATH:ko00510]
            Metabolism; Glycan Biosynthesis and Metabolism; Various types of 
N-glycan biosynthesis [PATH:ko00513]
            Genetic Information Processing; Folding, Sorting and Degradation; 
Protein processing in endoplasmic reticulum [PATH:ko04141]
DBLINKS     GO: 0008250
GENES       SCE: YDL232W(OST4)
            AGO: AGOS_ABL170C
            KLA: KLLA0A01287g
            VPO: Kpol_1054p35
            SSL: SS1G_13465
REFERENCE   PMID:15001703
  AUTHORS   Zubkov S, Lennarz WJ, Mohanty S
  TITLE     Structural basis for the function of a minimembrane protein subunit 
of yeast oligosaccharyltransferase.
  JOURNAL   Proc Natl Acad Sci U S A 101:3821-6 (2004)
///

-- 
To unsubscribe, e-mail: beginners-unsubscr...@perl.org
For additional commands, e-mail: beginners-h...@perl.org
http://learn.perl.org/


Reply via email to