Re: XML Parsing for large XML documents

Ed Summers Thu, 26 Feb 2004 22:17:56 -0800

On Thu, Feb 26, 2004 at 09:47:08AM -0500, Robert Fox wrote:
> Scanning an entire document of this size in order to perform very specific 
> event handling for each operation (using SAX) seems like it would be just 
> as time consuming as having the entire node tree represented in memory. 
> Please correct me if I'm wrong here.


Without an idea of what sort of data massaging/extraction you are doing I
can't really make any comment here. There is no question that moving your
program from a DOM based approach to SAX stream based approach would
require you to totally rethink the way your program works...no small
effort I imagine :) I was able to do some fairly complex things with SAX
and SAX filters in Net::OAI::Harvester which needed to process potentially
large XML responses, which made DOM processing out of the question.

> On the plus side, I am running this process on a machine that seems to have 
> enough RAM to represent the entire document and my code structures (arrays, 
> etc.) without the need for virtual memory and heavy disk I/O. However, the 
> process is VERY CPU intensive because of all of the sorting and lookups 
> that occur for many of the operations. I'm going to see today if I can make 
> those more efficient as well.

Do you see a long delay as the DOM is being built? I imagine that this
is where your bottleneck is. A print statement before and after parsing
should show this.

> Someone else has suggested to me that perhaps it would be a good idea to 
> break up the larger document into smaller parts during processing and only 
> work on those parts in a serial mode. It was also suggested that 
> XML::LibXML was an efficient tool because of the C library core (libxml2). 
> And, I've also now heard of "hybrid" parsers that allow the ease of use and 
> flexibility of DOM with the efficiency of SAX (RelaxNGCC).

Breaking up the document into smaller chunks will make the memory footprint
lower (smaller DOM), but the total processing time will stay the same if you 
aren't having memory problems...assuming you don't process the chunks in
parallel. 

I remember seeing XML::Sablotron being used by Lifetime TV to process 
XML data. It's Perl glue for a C library, so you'll get much better 
performance than from a pure Perl solution. Definitely give Sablotron a try 
since your current tool (XML::XPath) is pure Perl. Same goes for XML::LibXML. 

I'd be interested in knowing how it goes, so please update the list with 
your findings :)

//Ed

Re: XML Parsing for large XML documents

Reply via email to