Re: Trying to parse a HUGE(1gb) xml file

Alan Meyer Mon, 27 Dec 2010 12:48:14 -0800

On 12/21/2010 3:16 AM, Stefan Behnel wrote:

Adam Tauno Williams, 20.12.2010 20:49:

...

You need to process the document as a stream of elements; aka SAX.


IMHO, this is the worst advice you can give.

Why do you say that? I would have thought that using SAX in thisapplication is an excellent idea.

I agree that for applications for which performance is not a problem,and for which we need to examine more than one or a few element types, atree implementation is more functional, less programmer intensive, andprovides an easier to understand approach to the data. But with hugeamounts of data where performance is a problem SAX will be far morepractical. In the special case where only a few elements are ofinterest in a complex tree, SAX can sometimes also be more natural andeasy to use.

SAX might also be more natural for this application. The O.P. couldtell us for sure, but I wonder if perhaps his 1 GB XML file is NOT atrue single record. You can store an entire text encyclopedia in lessthan one GB. What he may have is a large number logically distinctindividual records of some kind, each stored as a node in anall-encompassing element wrapper. Building a tree for each record couldmake sense but, if I'm right about the nature of the data, building atree for the wrapper gives very little return for the high cost.


If that's so, then I'd recommend one of two approaches:

1. Use SAX, or

2. Parse out individual logical records using string manipulation on aninput stream, then build a tree for one individual record in memoryusing one of the DOM or ElementTree implementations. After each recordis processed, discard its tree and start on the next record.


    Alan
--
http://mail.python.org/mailman/listinfo/python-list

Re: Trying to parse a HUGE(1gb) xml file

Reply via email to