Thanks for the replies.
Your suggestions are very good. Here is my problem, though: I don't think that I can process this document in a serial fashion, which seems to be more akin to SAX. I need to do a lot of node hopping in order to create somewhat complex data structures for import into the database, and that requires a lot of jumping around from one part of the node tree to another. Thus, it seems as though I need to use a DOM parser to accomplish this. Scanning an entire document of this size in order to perform very specific event handling for each operation (using SAX) seems like it would be just as time consuming as having the entire node tree represented in memory. Please correct me if I'm wrong here.
On the plus side, I am running this process on a machine that seems to have enough RAM to represent the entire document and my code structures (arrays, etc.) without the need for virtual memory and heavy disk I/O. However, the process is VERY CPU intensive because of all of the sorting and lookups that occur for many of the operations. I'm going to see today if I can make those more efficient as well.
Someone else has suggested to me that perhaps it would be a good idea to break up the larger document into smaller parts during processing and only work on those parts in a serial mode. It was also suggested that XML::LibXML was an efficient tool because of the C library core (libxml2). And, I've also now heard of "hybrid" parsers that allow the ease of use and flexibility of DOM with the efficiency of SAX (RelaxNGCC).
For those of you that haven't heard of these tools before, you might want to check out:
XML::Sablotron (similar to XML::LibXML) XMLPull (http://www.xmlpull.org) Piccolo Parser (http://piccolo.sourceforge.net) RelaxNGCC (http://relaxngcc.sourceforge.net/en/index.htm)
I get the impression that if I tried to use SAX parsing for a relatively complex RDF document, the programming load would be rather significant. But, if it speeds up processing by several orders of magnitude, then it would be worth it. I'm concerned, though, that I won't have the ability to crawl the document nodes using conditionals and revert to previous portions of the document that need further processing. What is your experience in this regard?
Thanks again for the responses. This is great.
Rob
At 11:07 AM 2/26/2004 +0000, Peter Corrigan wrote:
On 25 February 2004 20:31 wrote... >1. Am I using the best XML processing module that I can for this sort of > task?
If it must be faster, then it might be worth porting what you have to work with LibXML which has all round impressive benchmarks especially for DOM work. Useful comparisons may be found at: http://xmlbench.sourceforge.net/results/benchmark/index.html
Remember that the size of the final internal representation used to manipulate the XML data for DOM could be up to 5 times the original size i.e. 270mb in your case. Simply adding RAM/porting your exising code to another machine might be enough to give you the speed-up you require.
>3. What is the most efficient way to process through such a large document > no matter what XML processor one uses?
SAX type processing will be faster and use less memory. If you need random access to any point of the tree after the document has been read you will need DOM, hence you will need lots of memory.
If this is a one off load, I guess you have to balance the cost of your time recoding with the cost of waiting for the data to load using what you have already. Machines usually work cheaper :-)
Best of luck
Peter Corrigan Head of Library Systems James Hardiman Library NUI Galway IRELAND Tel: +353-91-524411 Ext 2497 Mobile: +353-87-2798505
-----Original Message----- From: Robert Fox [mailto:[EMAIL PROTECTED] Sent: 25 February 2004 20:31 To: [EMAIL PROTECTED]; [EMAIL PROTECTED] Subject: XML Parsing for large XML documents
I'm cross posting this question to perl4lib and xml4lib, hoping that someone will have a suggestion.
I've created a very large (~54MB) XML document in RDF format for the purpose of importing related records into a database. Not only does the RDF document contain many thousands of individual records for electronic resources (web resources), but it also contains all of the "relationships" between those resources encoded in such a way that the document itself represents a rather large database of these resources. The relationships
are multi-tiered. I've also written a Perl script which can parse this large document and process through all of the XML data in order to import the data, along with all of the various relationships, into the database. The Perl script uses XML::XPath, and XML::XPath::XMLParser. I use these modules to find the appropriate document nodes as needed while the processing is going on and the database is being populated. The database is not a flat file: several data tables and linking tables are involved.
I've run into a problem, though: my Perl script runs very slowly. I've done just about everything I can to optimize my script so that it isn't memory intensive and efficient, and nothing seems to have significantly helped.
Therefore, I have a couple of questions for the list(s):
1. Am I using the best XML processing module that I can for this sort of task? 2. Has anyone else processed documents of this size, and what have they used? 3. What is the most efficient way to process through such a large document no matter what XML processor one uses?
The processing on this is so amazingly slow that it is likely to take many hours if not days(!) to process through the bulk of records in this XML document. There must be a better way.
Any suggestions or help would be much appreciated,
Rob Fox
Robert Fox Sr. Programmer/Analyst University Libraries of Notre Dame (574)631-3353 [EMAIL PROTECTED]