On 12/21/2010 3:16 AM, Stefan Behnel wrote:
Adam Tauno Williams, 20.12.2010 20:49:
...
You need to process the document as a stream of elements; aka SAX.
IMHO, this is the worst advice you can give.
Why do you say that? I would have thought that using SAX in this
application is an excellent idea.
I agree that for applications for which performance is not a problem,
and for which we need to examine more than one or a few element types, a
tree implementation is more functional, less programmer intensive, and
provides an easier to understand approach to the data. But with huge
amounts of data where performance is a problem SAX will be far more
practical. In the special case where only a few elements are of
interest in a complex tree, SAX can sometimes also be more natural and
easy to use.
SAX might also be more natural for this application. The O.P. could
tell us for sure, but I wonder if perhaps his 1 GB XML file is NOT a
true single record. You can store an entire text encyclopedia in less
than one GB. What he may have is a large number logically distinct
individual records of some kind, each stored as a node in an
all-encompassing element wrapper. Building a tree for each record could
make sense but, if I'm right about the nature of the data, building a
tree for the wrapper gives very little return for the high cost.
If that's so, then I'd recommend one of two approaches:
1. Use SAX, or
2. Parse out individual logical records using string manipulation on an
input stream, then build a tree for one individual record in memory
using one of the DOM or ElementTree implementations. After each record
is processed, discard its tree and start on the next record.
Alan
--
http://mail.python.org/mailman/listinfo/python-list