From: Nigel Wetters [mailto:[EMAIL PROTECTED]]
> I don't know about how XML::Parser handles memory - last time
> I tried to use it to parse content.rdf from http://dmoz.org ,
> it soaked up all my memory, then bombed. Sometimes, you need
> to write your own parsing subs :)
A casual reader could take that to imply that XML::Parser has
memory leaks. If you have a test case which demonstrates
XML::Parser leaking memory then please forward it to the
maintainer (Clark Cooper). My experience of XML::Parser is
that it is extremely solid and much faster than regexes for
serious parsing.
Is the file you referred to a really big file? If so, any
parsing module that produced a tree (eg: XML::Parser in Tree
style, XML::DOM, XML::Simple etc) would require memory to the
tune of 'n' times the number of bytes in the original file
(where 'n' is probably a bigger number than you'd guess).
When dealing with large XML files, an event based parser (eg:
XML::Parser in native mode, the SAX modules) or a hybrid event/
tree module like XML::Twig is best.
But I can only echo what other people have said about parsing
XML with regexes - just don't do it! There are plenty of things
that might exist in an XML document that are really hard to cope
with using regexes (eg: encoding, entity definitions, entity
expansion, CDATA sections etc). If your regexes work with simple
XML, you'll come to rely on them and then they'll break because
you give them a document with a euro symbol. So you fix that and
they break when you get a document with an encoding declaration
you weren't prepared for. And on it goes. I can heartily recommend
using XML parsers for parsing XML.
Notwithstanding that rant (sorry, I'm over it now), as the author
of XML::Simple, I *can't* recommend it for solving the question
originally asked. The sample XML used mixed content (elements
containing both text and nested elements) which XML::Simple does
not (and will not) support.
The two modules I'd be inclined to recommend for the original
problem are XML::XPath or XML::Twig. The former is more standards
based and the latter is more Perlish. I'd have to see some of the
actual XML and the required output before I went any further.
Regards
Grant