Re: Parsing file

venkates Thu, 02 Jun 2011 07:41:58 -0700

On 6/2/2011 2:44 PM, Rob Coops wrote:

On Thu, Jun 2, 2011 at 1:28 PM, venkates<venka...@nt.ntnu.no>  wrote:

On 6/2/2011 12:46 PM, John SJ Anderson wrote:

On Thu, Jun 2, 2011 at 06:41, venkates<venka...@nt.ntnu.no>   wrote:

Hi,

I want to parse a file with contents that looks as follows:

[ snip ]

Have you considered using this module? ->
<http://search.cpan.org/dist/BioPerl/Bio/SeqIO/kegg.pm>

Alternatively, I think somebody on the BioPerl mailing list was
working on another KEGG parser...

chrs,
j.

  I am doing this as an exercise  to learn parsing techniques so guidance

help needed.

Aravind



--
To unsubscribe, e-mail: beginners-unsubscr...@perl.org
For additional commands, e-mail: beginners-h...@perl.org
http://learn.perl.org/

This is a simple and ugly way of parsing your file:

use strict;
use warnings;
use Carp;
use Data::Dumper;

my $set = parse("ko");

sub parse {
  my $keggFile = shift;
  my $keggHash;

  my $counter = 1;

  open my $fh, '<', $keggFile || croak ("Cannot open file '$keggFile': $!");
  while (<$fh>  ) {
   chomp;
   if ( $_ =~ m!///! ) {
    $counter++;
    next;
   }

   if ( $_ =~ /^ENTRY\s+(.+?)\s/sm ) { ${$keggHash}{$counter} = { 'ENTRY' =>
$1 }; }

While trying a similar thing for DEFINITION record, instead of appendingcurrent hash with ENTRY and NAME, the DEFINITION record replaces thecontents in the hash?


$VAR1 = {
          '4' => {
                   'DEFINITION' => 'U18 small nucleolar RNA'
                 },
          '1' => {
                   'DEFINITION' => 'alcohol dehydrogenase [EC:1.1.1.1]'
                 },
          '3' => {
                   'DEFINITION' => 'U14 small nucleolar RNA'
                 },
          '2' => {

'DEFINITION' => 'alcohol dehydrogenase (NADP+)[EC:1.1.1.2]'

                 },
          '5' => {
                   'DEFINITION' => 'U24 small nucleolar RNA'
                 }
        };

code: in addition to what you had suggested -
if($_ =~ /^DEFINITION\s{2}(.+)?/){
               ${$keggHash}{$counter} = {'DEFINITION' => $1};
           }

   if ( $_ =~ /^NAME\s+(.*)$/sm ) {
    my $temp = $1;
    $temp =~ s/,\s/,/g;
    my @names = split /,/, $temp;
    push @{${$keggHash}{$counter}{'NAME'}}, @names;
   }
  }
  close $fh;
  print Dumper $keggHash;
}

The output being:

$VAR1 = {
           '1' =>  {
                    'NAME' =>  [
                                'E1.1.1.1',
                                'adh'
                              ],
                    'ENTRY' =>  'K00001'
                  },
           '3' =>  {
                    'NAME' =>  [
                                'U18snoRNA',
                                'snR18'
                              ],
                    'ENTRY' =>  'K14866'
                  },
           '2' =>  {
                    'NAME' =>  [
                                'U14snoRNA',
                                'snR128'
                              ],
                    'ENTRY' =>  'K14865'
                  }
         };

Which to me looks sort of like what you are looking for.
The main thing I did was read the file one line at a time to prevent a
unexpectedly large file from causing memory issues on your machine (in the
end the structure that you are building will cause enough issues
when handling a large file.

You already dealt with the Entry bit so I'll leave that open though I
slightly changed the regex but nothing spectacular there.
The Name bit is simple as I just pull out all of them then then remove all
spaces and split them into an array, feed the array to the hash and hop time
for the next step which is up to you ;-)

I hope it helps you a bit, regards,

Rob



--
To unsubscribe, e-mail: beginners-unsubscr...@perl.org
For additional commands, e-mail: beginners-h...@perl.org
http://learn.perl.org/

Re: Parsing file

Reply via email to