Ken Foskey wrote:

On Tue, 2008-03-18 at 00:55 +1100, Ken Foskey wrote:
I am extracting addresses from an XML file to process through other
programs using pipe delimiter the following code works but this is going
to get 130,000 records through it it must be very efficient and I cannot
follow the documentation on the best way to do this.

After this simple one is programmed I have to change a much more complex
version of this program.

#!/usr/bin/perl -w
# vi:set sw=4 ts=4 et cin:
# $Id:$

=head1 SYNOPSIS

Extract addresses from an XML file into pipe delimited file.

   usage: address_extract.pl  xml_file

=cut

use warnings;
use strict;

use XML::Twig qw(:strict);

sub no_pipe
{
    my $value = shift;

    $value =~ s/\|//g;
    return $value;
}

if( ! -f $ARGV[0] ) {
    print "$ARGV[0] is not a filename, requires filename as first
parameter!\n";
}

my $sort;
my $sort_file = $ARGV[0].'.unsorted';
unlink $sort_file; # in case of rerun
open( $sort, '>', $sort_file ) or die "Unable to open $sort_file for output $!";

my $ref = XML::Twig->new( twig_handlers=>{mem=>\&member} ) or die "Unable to open $ARGV[0] $!";

my $member = 0;

$ref->parsefile( $ARGV[0] );

sub get_value
{
    my ($mem_ref, $key) = @_;
    my @array = $mem_ref->descendants( $key );
    return $array[0]->text();
}

sub member {
    my ($twig, $mem_ref) = @_;
    $member++;

    my $mem_no = get_value( $mem_ref, 'member' );
    my $add1   = get_value( $mem_ref, 'add1' );
    my $add2   = get_value( $mem_ref, 'add2' );
    my $add3   = get_value( $mem_ref, 'add3' );
    my $suburb = get_value( $mem_ref, 'suburb' );
    my $state  = get_value( $mem_ref, 'state' );
    my $pcode  = get_value( $mem_ref, 'pcode' );

    print $sort join( '|', $member,
                     $mem_no,
                     no_pipe( $add1 ),
                     no_pipe( $add2 ),
                     no_pipe( $add3 ),
                     no_pipe( $suburb),
                     $state,
                     $pcode,
                    ) ."\n";
        $twig->purge;

    return 1;
}
>
> Updated to fix memory problem, you have to purge.   Takes over 30
> minutes for 120K records.
>
> I am sure that  the whole process can be done better with a good
> understanding of the module.  Will benchmark XML::Rules though.

Hi Ken

OK so your query was about how to make your code run faster. 15ms per
record isn't bad, but I'm sure it would be faster if you junk your
get_value and no_pipe functions and rewrite member something like
this:

sub member {

    my $mem_ref = pop;

    $member++;

    my @data = map $mem_ref->first_child_text($_),
        qw/member add1 add2 add3 suburb state pcode/;

    tr/|//d foreach @data;

    print $sort join('|', @data), "\n";
}

Oh, and there's no point in the die call on XML::Twig->new as something
dire is wrong if this returns false, and anyway it's certainly not
because the input file can't be opened. The parsefile method will throw
its own die if it can't open the file, so all bases are covered already.

HTH,

Rob


--
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
http://learn.perl.org/


Reply via email to