Hi Peter,

Your codes look great and it works perfectly with only some minor problems which might due to the XML file itself (I think). However, compared your codes with mine, there are something I'd like to ask you if you don't mind.

1) what's the main difference on memory load bewteen setting handlers and without setting handlers before calling $parser->parsefile($xml)?

Does it mean that yours actually access the XML file partially, the first handler only treats for <Topic/> and the last handler is only for considers <ExternalPage/>. If so, does setting handlers actually change the way of loading a file?

2) My understanding about your codes is, first you looked at <Topic/> nodes and found if they have <link/> child/children, if they have, you saved them into a hash table for later <ExternalPage/> comparisions. But my question is, how are you going to search all <Topic/> and all <ExternalPage/>one by one by just call the subroutine once without using any kinds of loop? and how can you link these 2 handlers together?

3) My original intention is for each <Topic/> with valid <link/> child/children, to open a file in a directory named exactly the same as what is found in a Topic->att('about') then write all links information found in <ExternalPage/> then close the file. However, after reading at your code times and times, I don't know where should I close the file handler because sub _links_handler is used for finding out links one by one and I don't know when a <ExternalPage/> is finished from parsing.

Is there any suggestion about this?

Sorry I'm really new in Perl XML processing...

Many many thanks again,

Nan

From: Peter Rabbitson <[EMAIL PROTECTED]>
To: beginners@perl.org
Subject: Re: Errors on processing 2GB XML file by using XML:Simple
Date: Mon, 16 May 2005 09:31:15 -0500

On Mon, May 16, 2005 at 01:33:15PM +0000, Nan Jiang wrote:
> While I think <Topic/> and <ExternalPage/> are not randomly intermixed as
> <Topic/> nodes are generated in relevant categories such as <Arts/> ->
> <Arts/Movie> -> <Arts/Movie/Title> and then if the <Topic/> has <link/>
> children which means it is a final category, then <ExternalPage/> nodes
> appeared immediatly below the <Topic/> with the same order as <link/>.
>


The problem is that you completely misunderstood the idea of XMLtwig. You
parse as you go. Here is the code that gives somewhat similar to your
output. Don't get surprised by the ->simplify I use to deconstruct twigs - I
am just used to it and it is merely a matter of style. You can very well use
parent firstchild att and family. And remember - when working with XML::Twig
Data::Dumper takes a whole new meaning :)


#!/usr/bin/perl

use warnings;
use strict;
use XML::Twig;

my $xml = '<RDF>
<Topic r:id="Top">
<catid>1</catid>
</Topic>

<ExternalPage about="">
<topic>Top/</topic>
</ExternalPage>

<Topic r:id="Top/Arts">
<catid>2</catid>
</Topic>

<Topic r:id="Top/Arts/Movies/Titles/1/10_Rillington_Place">
<catid>205108</catid>
<link r:resource="http://www.britishhorrorfilms.co.uk/rillington.shtml"/>
<link
r:resource="http://www.shoestring.org/mmi_revs/10-rillington-place.html"/>
</Topic>

<ExternalPage about="http://www.britishhorrorfilms.co.uk/rillington.shtml";>
<d:Title>British Horror Films: 10 Rillington Place</d:Title>
<d:Description>Review which looks at plot especially the shocking features
of it.</d:Description>
<topic>Top/Arts/Movies/Titles/1/10_Rillington_Place</topic>
</ExternalPage>

<ExternalPage
about="http://www.shoestring.org/mmi_revs/10-rillington-place.html";>
<d:Title>MMI Movie Review: 10 Rillington Place</d:Title>
<d:Description>Review includes plot, real life story behind the film and
realism in the film.</d:Description>
<topic>Top/Arts/Movies/Titles/1/10_Rillington_Place</topic>
</ExternalPage>
</RDF>';

my %want_links;

my $parser = XML::Twig->new ( twig_handlers => { 'Topic' => \&_topic_handler,
'ExternalPage' => \&_links_handler },
);


$parser->parse($xml);   #parse XML data

exit 0;


sub _topic_handler {

    my ($twig, $child) = @_;
    my $topic = $child->simplify (forcearray => 1);

if ($topic->{link}) {
%want_links = map { $_->{'r:resource'}, $topic->{'r:id'} } @{$topic->{link}}; #generate hash 'link_name' => 'directory'
}
else {
%want_links = (); #reset the hash since we are working on a new topic (no more external links)
}


    $twig->purge;
}

sub _links_handler {

    my ($twig, $child) = @_;
    my $ext_page = $child->simplify (forcearray => 1);

if ($want_links{$ext_page->{about}}) {
#chdir $want_links{$ext_page->{about}} #commented out since I don't have that dir
print join ("\n", $want_links{$ext_page->{about}},
$ext_page->{'d:Title'}[0],
$ext_page->{'d:Description'}[0],
);
print "\n\n";
}


    $twig->purge;
}


-- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] <http://learn.perl.org/> <http://learn.perl.org/first-response>





-- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] <http://learn.perl.org/> <http://learn.perl.org/first-response>




Reply via email to