> Your codes look great and it works perfectly with only some minor problems > which might due to the XML file itself (I think). However, compared your > codes with mine, there are something I'd like to ask you if you don't mind.
Not that much :) > 1) what's the main difference on memory load bewteen setting handlers and > without setting handlers before calling $parser->parsefile($xml)? > > Does it mean that yours actually access the XML file partially, the first > handler only treats for <Topic/> and the last handler is only for considers > <ExternalPage/>. If so, does setting handlers actually change the way of > loading a file? Take a look at the XML snippet you sent me as sample data. You have a regular text file with variable data fields (in other words different keywords/flags/operators/tags etc have unpredictable length/size in bytes). So the only way to read such a file is going byte by byte and analyze everything as we go. What you were doing in your code is doing this byte by byte reading up until you don't encounter an EOF, which was taking the corresponding amount of memory. Now let's break your example apart (I am deliberately ommiting lots of data and adding some): <RDF> <Topic> <1st topic related data> </Topic> <ExternalPage> <1st external page related to 1st topic> </ExternalPage> <ExternalPage> <2nd external page still related to 1st topic> </ExternalPage> <Topic> <2nd topic related data> </Topic> <ExternalPage> <1st external page related to 2nd topic> </ExternalPage> <Topic> <3rd topic> </Topic> <SomeOtherTag> <Some Other Data> </SomeOtherTag> <Topic> <4th topic> </Topic> <ExternalPage> <1st external page related to 4th topic> </ExternalPage> <ExternalPage> <2nd external page still related to 4th topic> </ExternalPage> </RDF> Then the following parser declartion: my $parser = XML::Twig->new ( twig_handlers => { 'Topic' => \&_topic_handler, 'ExternalPage' => \&_links_handler, }, ); $parser->parse($xml); simply means: Start walking through the XML data (variable, file, url) and keep going until you see a completed tag (opening tag followed by arbitrary amount of data and then closing tag). If the tag we just found matches the twig handler <Topic>...</Topic> - call subroutine _topic_handler and pass as arguments the "twig", in other words this particular tag object, and all "children", in other words all subtags between <Topic> and </Topic>. At the end of each _topic_handler subroutine I took all I needed from the passed twig, so I can safely throw it away thus reclaiming memory - I execute a ->purge. Same goes for _links_handler > 2) My understanding about your codes is, first you looked at <Topic/> nodes > and found if they have <link/> child/children, if they have, you saved them > into a hash table for later <ExternalPage/> comparisions. But my question > is, how are you going to search all <Topic/> and all <ExternalPage/>one by > one by just call the subroutine once without using any kinds of loop? and > how can you link these 2 handlers together? It is the $parser->parse($xml) line that creates the loop - it will keep going until there is data in the XML, just like while (<>) will keep going until there is input from STDIN. Everytime we see a tag set defined in twig_handlers we will call the corresponding subroutine and do whatever we got to do. Keep in mind that if it encounters something else that is not described in the handlers (just like SomeOtherTag between the 3rd and 4th Topic) it will simply be ignored without occupying any memory. This is why you can use XMLTwig to process huge files out of which you need several scattered tags. > 3) My original intention is for each <Topic/> with valid <link/> > child/children, to open a file in a directory named exactly the same as > what is found in a Topic->att('about') then write all links information > found in <ExternalPage/> then close the file. However, after reading at > your code times and times, I don't know where should I close the file > handler because sub _links_handler is used for finding out links one by one > and I don't know when a <ExternalPage/> is finished from parsing. This is exactly why I said - if <Topic> is not ALWAYS followed by its OWN <ExternalPage> links - you are screwed. If this is the case you will never know if you should expect yet another <ExternalPage> tag somewhere after 1GB of data that will refer to a <Topic> in the very beginning of the file. Thus I assume in my code that when we see another <Topic> we are done with the previous one, this is wht I was completely reassigning %want_links, because I do not expect any more info pertaining to the previous <Topic>. This is where you should close your files as well - keep a global variable $last_filename and close it at the beginning of each new <TOPIC> > Is there any suggestion about this? In your case you could go a slightly different way which will work for any mixture of <Topic> and <ExternalPage> even if the pages come BEFORE the topic tag itself: ===each _topic_handler should: * See what links are present in the tag and collect them in some accessible manner. You could stuff them into a hash with index the link itself and data part as Topic->att('about') or if the amount of topics does not permit it (the hash grows out of memory) you can use a DBM or something similar. * See if any of the newly collected links are dangling links and create files for them (see below). Delete the links from the dangling hash/DB * Purge the twig we were working on === each _links_handler should: * See if the links it contains are already listed in the hash/database described above. If they are create the files as needed. Delete the links from the hash/DB * If there are links but there are no <Topic><about> references yet, stuff the data into a dangling links hash/database so they can wait until the right <Topic> is found * Purge the twig we were working on If you wrote everything correctly you will end up with the files you need, a hash/DB containing all links for which info was missing and a hash/DB containing all info for which a topic was missing, which should be all you will ever want from a script like that. However first examine your file and if you can determine that ALL <Topic>s are followed by THEIR OWN <ExternalPage>s, you can safely use as a base what I wrote. Peter -- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] <http://learn.perl.org/> <http://learn.perl.org/first-response>