On 12/21/2010 3:16 AM, Stefan Behnel wrote:
Adam Tauno Williams, 20.12.2010 20:49:
...
You need to process the document as a stream of elements; aka SAX.

IMHO, this is the worst advice you can give.

Why do you say that? I would have thought that using SAX in this application is an excellent idea.

I agree that for applications for which performance is not a problem, and for which we need to examine more than one or a few element types, a tree implementation is more functional, less programmer intensive, and provides an easier to understand approach to the data. But with huge amounts of data where performance is a problem SAX will be far more practical. In the special case where only a few elements are of interest in a complex tree, SAX can sometimes also be more natural and easy to use.

SAX might also be more natural for this application. The O.P. could tell us for sure, but I wonder if perhaps his 1 GB XML file is NOT a true single record. You can store an entire text encyclopedia in less than one GB. What he may have is a large number logically distinct individual records of some kind, each stored as a node in an all-encompassing element wrapper. Building a tree for each record could make sense but, if I'm right about the nature of the data, building a tree for the wrapper gives very little return for the high cost.

If that's so, then I'd recommend one of two approaches:

1. Use SAX, or

2. Parse out individual logical records using string manipulation on an input stream, then build a tree for one individual record in memory using one of the DOM or ElementTree implementations. After each record is processed, discard its tree and start on the next record.

    Alan
--
http://mail.python.org/mailman/listinfo/python-list

Reply via email to