Re: Errors on processing 2GB XML file by using XML:Simple

Peter Rabbitson Tue, 17 May 2005 05:40:41 -0700

> Your codes look great and it works perfectly with only some minor problems 
> which might due to the XML file itself (I think). However, compared your 
> codes with mine, there are something I'd like to ask you if you don't mind.


Not that much :)

> 1) what's the main difference on memory load bewteen setting handlers and 
> without setting handlers before calling $parser->parsefile($xml)?
> 
> Does it mean that yours actually access the XML file partially, the first 
> handler only treats for <Topic/> and the last handler is only for considers 
> <ExternalPage/>. If so, does setting handlers actually change the way of 
> loading a file?

Take a look at the XML snippet you sent me as sample data. You have a 
regular text file with variable data fields (in other words different 
keywords/flags/operators/tags etc have unpredictable length/size in bytes). 
So the only way to read such a file is going byte by byte and analyze 
everything as we go. What you were doing in your code is doing this byte by 
byte reading up until you don't encounter an EOF, which was taking the 
corresponding amount of memory. Now let's break your example apart (I am 
deliberately ommiting lots of data and adding some):

<RDF>
        <Topic>
                <1st topic related data>
        </Topic>

        <ExternalPage>
                <1st external page related to 1st topic>
        </ExternalPage>

        <ExternalPage>
                <2nd external page still related to 1st topic>
        </ExternalPage>

        <Topic>
                <2nd topic related data>
        </Topic>

        <ExternalPage>
                <1st external page related to 2nd topic>
        </ExternalPage>

        <Topic>
                <3rd topic>
        </Topic>

        <SomeOtherTag>
                <Some Other Data>
        </SomeOtherTag>
        
        <Topic>
                <4th topic>
        </Topic>

        <ExternalPage>
                <1st external page related to 4th topic>
        </ExternalPage>

        <ExternalPage>
                <2nd external page still related to 4th topic>
        </ExternalPage>
</RDF>

Then the following parser declartion:

my $parser = XML::Twig->new (   twig_handlers => {  
                                        'Topic' => \&_topic_handler,
                                        'ExternalPage' => \&_links_handler,
                                },
                        );

$parser->parse($xml);  

simply means: 

Start walking through the XML data (variable, file, url) and keep going
until you see a completed tag (opening tag followed by arbitrary amount of
data and then closing tag). If the tag we just found matches the twig
handler <Topic>...</Topic> - call subroutine _topic_handler and pass as
arguments the "twig", in other words this particular tag object, and all
"children", in other words all subtags between <Topic> and </Topic>. At the
end of each _topic_handler subroutine I took all I needed from the passed
twig, so I can safely throw it away thus reclaiming memory - I execute a
->purge. Same goes for _links_handler

> 2) My understanding about your codes is, first you looked at <Topic/> nodes 
> and found if they have <link/> child/children, if they have, you saved them 
> into a hash table for later <ExternalPage/> comparisions. But my question 
> is, how are you going to search all <Topic/> and  all <ExternalPage/>one by 
> one by just call the subroutine once without using any kinds of loop? and 
> how can you link these 2 handlers together?

It is the $parser->parse($xml) line that creates the loop - it will keep 
going until there is data in the XML, just like while (<>) will keep going 
until there is input from STDIN. Everytime we see a tag set defined in 
twig_handlers we will call the corresponding subroutine and do whatever we 
got to do. Keep in mind that if it encounters something else that is not 
described in the handlers (just like SomeOtherTag between the 3rd and 4th 
Topic) it will simply be ignored without occupying any memory. This is why 
you can use XMLTwig to process huge files out of which you need several 
scattered tags.
 
> 3) My original intention is for each <Topic/> with valid <link/> 
> child/children, to open a file in a directory named exactly the same as 
> what is found in a Topic->att('about') then write all links information 
> found in <ExternalPage/> then close the file. However, after reading at 
> your code times and times, I don't know where should I close the file 
> handler because sub _links_handler is used for finding out links one by one 
> and I don't know when a <ExternalPage/> is finished from parsing.

This is exactly why I said - if <Topic> is not ALWAYS followed by its OWN 
<ExternalPage> links - you are screwed. If this is the case you will never 
know if you should expect yet another <ExternalPage> tag somewhere after 1GB 
of data that will refer to a <Topic> in the very beginning of the file. Thus 
I assume in my code that when we see another <Topic> we are done with the 
previous one, this is wht I was completely reassigning %want_links, because 
I do not expect any more info pertaining to the previous <Topic>. This is 
where you should close your files as well - keep a global variable 
$last_filename and close it at the beginning of each new <TOPIC>
 
> Is there any suggestion about this?

In your case you could go a slightly different way which will work for
any mixture of <Topic> and <ExternalPage> even if the pages come BEFORE the
topic tag itself:

===each _topic_handler should: 

* See what links are present in the tag and collect them in some accessible
manner. You could stuff them into a hash with index the link itself and data
part as Topic->att('about') or if the amount of topics does not permit it
(the hash grows out of memory) you can use a DBM or something similar. 

* See if any of the newly collected links are dangling links and create 
files for them (see below). Delete the links from the dangling hash/DB

* Purge the twig we were working on

=== each _links_handler should:

* See if the links it contains are already listed in the hash/database 
described above. If they are create the files as needed. Delete the links 
from the hash/DB

* If there are links but there are no <Topic><about> references yet, stuff 
the data into a dangling links hash/database so they can wait until the 
right <Topic> is found

* Purge the twig we were working on

If you wrote everything correctly you will end up with the files you need, a 
hash/DB containing all links for which info was missing and a hash/DB 
containing all info for which a topic was missing, which should be all you 
will ever want from a script like that. 

However first examine your file and if you can determine that ALL <Topic>s 
are followed by THEIR OWN <ExternalPage>s, you can safely use as a base what 
I wrote. 

Peter

-- 
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
<http://learn.perl.org/> <http://learn.perl.org/first-response>

Re: Errors on processing 2GB XML file by using XML:Simple

Reply via email to