On Thu, Feb 26, 2004 at 09:47:08AM -0500, Robert Fox wrote: > Scanning an entire document of this size in order to perform very specific > event handling for each operation (using SAX) seems like it would be just > as time consuming as having the entire node tree represented in memory. > Please correct me if I'm wrong here.
Without an idea of what sort of data massaging/extraction you are doing I can't really make any comment here. There is no question that moving your program from a DOM based approach to SAX stream based approach would require you to totally rethink the way your program works...no small effort I imagine :) I was able to do some fairly complex things with SAX and SAX filters in Net::OAI::Harvester which needed to process potentially large XML responses, which made DOM processing out of the question. > On the plus side, I am running this process on a machine that seems to have > enough RAM to represent the entire document and my code structures (arrays, > etc.) without the need for virtual memory and heavy disk I/O. However, the > process is VERY CPU intensive because of all of the sorting and lookups > that occur for many of the operations. I'm going to see today if I can make > those more efficient as well. Do you see a long delay as the DOM is being built? I imagine that this is where your bottleneck is. A print statement before and after parsing should show this. > Someone else has suggested to me that perhaps it would be a good idea to > break up the larger document into smaller parts during processing and only > work on those parts in a serial mode. It was also suggested that > XML::LibXML was an efficient tool because of the C library core (libxml2). > And, I've also now heard of "hybrid" parsers that allow the ease of use and > flexibility of DOM with the efficiency of SAX (RelaxNGCC). Breaking up the document into smaller chunks will make the memory footprint lower (smaller DOM), but the total processing time will stay the same if you aren't having memory problems...assuming you don't process the chunks in parallel. I remember seeing XML::Sablotron being used by Lifetime TV to process XML data. It's Perl glue for a C library, so you'll get much better performance than from a pure Perl solution. Definitely give Sablotron a try since your current tool (XML::XPath) is pure Perl. Same goes for XML::LibXML. I'd be interested in knowing how it goes, so please update the list with your findings :) //Ed