Re: Trying to parse a HUGE(1gb) xml file

John Nagle Wed, 22 Dec 2010 14:37:40 -0800

On 12/20/2010 12:33 PM, Adam Tauno Williams wrote:

On Mon, 2010-12-20 at 12:29 -0800, spaceman-spiff wrote:

I need to detect them&  then for each 1, i need to copy all the
content b/w the element's start&  end tags&  create a smaller xml
file.


Yep, do that a lot; via iterparse.

1. Can you point me to some examples/samples of using SAX,
especially , ones dealing with really large XML files.


   I've just subclassed HTMLparser for this.  It's slow, but
100% Python.  Using the SAX parser is essentially equivalent.
I'm processing multi-gigabyte XML files and updating a MySQL
database, so I do need to look at all the entries, but don't
need a parse tree of the XML.

SaX is equivalent to iterparse (iterpase is a way, to essentially, do
SaX-like processing).


   Iterparse does try to build a tree, although you can discard the
parts you don't want.  If you can't decide whether a part of the XML
is of interest until you're deep into it, an "iterparse" approach
may result in a big junk tree.  You have to keep clearing the "root"
element to discard that.

I provided an iterparse example already. See the Read_Rows method in
<http://coils.hg.sourceforge.net/hgweb/coils/coils/file/62335a211fda/src/coils/foundation/standard_xml.py>

I don't quite see the point of creating a class with only staticmethods. That's basically a verbose way to create a module.

2.This brings me to another q. which i forgot to ask in my OP(original post).
Is simply opening the file,&  using reg ex to look for the element i need, a 
*good* approach ?

No.


   If the XML file has a very predictable structure, that may not be
a bad idea.  It's not very general, but if you have some XML file
that's basically fixed format records using XML to delimit the
fields, pounding on the thing with a regular expression is simple
and fast.
                        
                                        John Nagle


--
http://mail.python.org/mailman/listinfo/python-list

Re: Trying to parse a HUGE(1gb) xml file

Reply via email to