RE: XML Parsing for large XML documents

Robert Fox Thu, 26 Feb 2004 22:18:04 -0800

Peter and Ed-

Thanks for the replies.

Your suggestions are very good. Here is my problem, though: I don't think that I can process this document in a serial fashion, which seems to be more akin to SAX. I need to do a lot of node hopping in order to create somewhat complex data structures for import into the database, and that requires a lot of jumping around from one part of the node tree to another. Thus, it seems as though I need to use a DOM parser to accomplish this. Scanning an entire document of this size in order to perform very specific event handling for each operation (using SAX) seems like it would be just as time consuming as having the entire node tree represented in memory. Please correct me if I'm wrong here.

On the plus side, I am running this process on a machine that seems to have enough RAM to represent the entire document and my code structures (arrays, etc.) without the need for virtual memory and heavy disk I/O. However, the process is VERY CPU intensive because of all of the sorting and lookups that occur for many of the operations. I'm going to see today if I can make those more efficient as well.

Someone else has suggested to me that perhaps it would be a good idea to break up the larger document into smaller parts during processing and only work on those parts in a serial mode. It was also suggested that XML::LibXML was an efficient tool because of the C library core (libxml2). And, I've also now heard of "hybrid" parsers that allow the ease of use and flexibility of DOM with the efficiency of SAX (RelaxNGCC).

For those of you that haven't heard of these tools before, you might want to check out:

XML::Sablotron (similar to XML::LibXML)
XMLPull (http://www.xmlpull.org)
Piccolo Parser (http://piccolo.sourceforge.net)
RelaxNGCC (http://relaxngcc.sourceforge.net/en/index.htm)

I get the impression that if I tried to use SAX parsing for a relatively complex RDF document, the programming load would be rather significant. But, if it speeds up processing by several orders of magnitude, then it would be worth it. I'm concerned, though, that I won't have the ability to crawl the document nodes using conditionals and revert to previous portions of the document that need further processing. What is your experience in this regard?

Thanks again for the responses. This is great.

Rob

At 11:07 AM 2/26/2004 +0000, Peter Corrigan wrote:

On 25 February 2004 20:31 wrote...
>1. Am I using the best XML processing module that I can for this sort
of
> task?

If it must be faster, then it might be worth porting what you have to
work with LibXML which has all round impressive benchmarks especially
for DOM work.
Useful comparisons may be found at:
http://xmlbench.sourceforge.net/results/benchmark/index.html

Remember that the size of the final internal representation used to
manipulate the XML data for DOM could be up to 5 times the original size
i.e. 270mb in your case. Simply adding RAM/porting your exising code to
another machine might be enough to give you the speed-up you require.

>3. What is the most efficient way to process through such a large
document
> no matter what XML processor one uses?

SAX type processing will be faster and use less memory. If you need
random access to any point of the tree after the document has been read
you will need DOM, hence you will need lots of memory.

If this is a one off load, I guess you have to balance the cost of your
time recoding with the cost of waiting for the data to load using what
you have already. Machines usually work cheaper :-)

Best of luck

Peter Corrigan
Head of Library Systems
James Hardiman Library
NUI Galway
IRELAND
Tel: +353-91-524411 Ext 2497
Mobile: +353-87-2798505


-----Original Message-----
From: Robert Fox [mailto:[EMAIL PROTECTED]
Sent: 25 February 2004 20:31
To: [EMAIL PROTECTED]; [EMAIL PROTECTED]
Subject: XML Parsing for large XML documents

I'm cross posting this question to perl4lib and xml4lib, hoping that
someone will have a suggestion.

I've created a very large (~54MB) XML document in RDF format for the
purpose of importing related records into a database. Not only does the
RDF
document contain many thousands of individual records for electronic
resources (web resources), but it also contains all of the
"relationships"
between those resources encoded in such a way that the document itself
represents a rather large database of these resources. The relationships

are multi-tiered. I've also written a Perl script which can parse this
large document and process through all of the XML data in order to
import
the data, along with all of the various relationships, into the
database.
The Perl script uses XML::XPath, and XML::XPath::XMLParser. I use these
modules to find the appropriate document nodes as needed while the
processing is going on and the database is being populated. The database
is
not a flat file: several data tables and linking tables are involved.

I've run into a problem, though: my Perl script runs very slowly. I've
done
just about everything I can to optimize my script so that it isn't
memory
intensive and efficient, and nothing seems to have significantly helped.

Therefore, I have a couple of questions for the list(s):

1. Am I using the best XML processing module that I can for this sort of
task?
2. Has anyone else processed documents of this size, and what have they
used?
3. What is the most efficient way to process through such a large
document
no matter what XML processor one uses?

The processing on this is so amazingly slow that it is likely to take
many
hours if not days(!) to process through the bulk of records in this XML
document. There must be a better way.

Any suggestions or help would be much appreciated,

Rob Fox

Robert Fox
Sr. Programmer/Analyst
University Libraries of Notre Dame
(574)631-3353
[EMAIL PROTECTED]

RE: XML Parsing for large XML documents

Reply via email to