On Mon, Mar 17, 2008 at 9:55 AM, Ken Foskey <[EMAIL PROTECTED]> wrote:
>
>  I am extracting addresses from an XML file to process through other
>  programs using pipe delimiter the following code works but this is going
>  to get 130,000 records through it it must be very efficient and I cannot
>  follow the documentation on the best way to do this.
>
>  After this simple one is programmed I have to change a much more complex
>  version of this program.
>
>  #!/usr/bin/perl -w
>  # vi:set sw=4 ts=4 et cin:
>  # $Id:$
>
>  =head1 SYNOPSIS
>
>  Extract addresses from an XML file into pipe delimited file.
>
>    usage: address_extract.pl  xml_file
>
>  =cut
>
>  use warnings;
>  use strict;
>
>  use XML::Twig qw(:strict);
>
>  sub no_pipe
>  {
>     my $value = shift;
>
>     $value =~ s/\|//g;
>     return $value;
>  }
>
>  if( ! -f $ARGV[0] ) {
>     print "$ARGV[0] is not a filename, requires filename as first
>  parameter!\n";
>  }
>
>  my $sort;
>  my $sort_file = $ARGV[0].'.unsorted';
>  unlink $sort_file; # in case of rerun
>  open( $sort, '>', $sort_file  )
>     or die "Unable to open $sort_file for output $!";
>
>  my $ref = XML::Twig->new( twig_handlers=>{mem=>\&member} )
>     or die "Unable to open $ARGV[0] $!";
>
>  my $member = 0;
>
>  $ref->parsefile( $ARGV[0] );
>
>  sub get_value
>  {
>     my ($mem_ref, $key) = @_;
>     my @array = $mem_ref->descendants( $key );
>     return $array[0]->text();
>  }
>
>  sub member {
>     my ($t, $mem_ref) = @_;
>     $member++;
>
>     my $mem_no = get_value( $mem_ref, 'member' );
>     my $add1   = get_value( $mem_ref, 'add1' );
>     my $add2   = get_value( $mem_ref, 'add2' );
>     my $add3   = get_value( $mem_ref, 'add3' );
>     my $suburb = get_value( $mem_ref, 'suburb' );
>     my $state  = get_value( $mem_ref, 'state' );
>     my $pcode  = get_value( $mem_ref, 'pcode' );
>
>     print $sort join( '|', $member,
>                      $mem_no,
>                      no_pipe( $add1 ),
>                      no_pipe( $add2 ),
>                      no_pipe( $add3 ),
>                      no_pipe( $suburb),
>                      $state,
>                      $pcode,
>                     ) ."\n";
>     return 1;
>  }
>
>

Ken,

If you're really worried about performance, then I would say two
places to look first would be all the temporary variables and
subroutine invocations. I don't know the ins and outs of XML::Twig, so
I don't really have any concrete advice--for instance, does
'$mem_ref->descendants( $key )->text()' work? some modules will parse
a structure like that, some won't--but in general, think about
creative ways you might use map to avoid three subroutine invocations
and (5? 6?) temporary variables for each element you process.
Something like the following should get you started down the path:

## Untested!

sub member {
my $t = shift;

print join "|",
map { s/\|//g }
map { $_[0]->descendants( $_ )->text() } qw/ member add1 add2 add3
suburb state pcode /;
}

That may not work out of the box depending on how deeply nested
XML::Twig's refs are, but hopefully you an see where I'm headed with
it.

Also, string concatenation is less efficient than adding another term
to prints argument list. i.e.: use ',' instead of '.' in print when
you can. and lastly, efficiency for matching is, IME, largely a matter
of system configuration and specific input data, but it's worth
benchmarking to see if substr() performs better than s/// in your
case. Sometimes you can get significant savings that way. sometimes
you can't, of course.

HTH,

-- jay
--------------------------------------------------
This email and attachment(s): [ ] blogable; [ x ] ask first; [ ]
private and confidential

daggerquill [at] gmail [dot] com
http://www.tuaw.com http://www.downloadsquad.com http://www.engatiki.org

values of β will give rise to dom!

Reply via email to