Andrew Robinson, 23.01.2013 16:22: > Good day :), > > I've been exploring XML parsers in python; particularly: > xml.etree.cElementTree; and I'm trying to figure out how to do it > incrementally, for very large XML files -- although I don't think the > problems are restricted to incremental parsing. > > First problem: > I've come across an issue where etree silently drops text without telling > me; and separate. > > I am under the impression that XHTML is a subset of XML (eg:defined tags), > and that once an HTML file is converted to XHTML, the body of the document > can be handled entirely as XML. > > If I convert a (partial/contrived) html file like: > > <html> > <div> > <p> This is example <b>bold</b> text. > </div> > </html> > > to XHTML, I might do --right or wrong-- (1): > > <html> > <div> > <p /> This is example <b>bold</b> text. > </div> > </html> > > or, alternate difference: (2): "<p> This is example <b>bold</b> text. </p>" > > But, when I parse with etree, in example (1) both "This is an example" and > "text." are dropped; > The missing text is part of the start, or end event tags, in the > incrementally parsed method. > > Likewise: In example (2), only "text" gets dropped.
Nope, you should read the manual on this. Here's a tutorial: http://lxml.de/tutorial.html#elements-contain-text This is using lxml.etree, which is the Python XML library most people use these days. It's ElementTree compatible, so the tutorial also works for ET (unless stated otherwise). > Isn't XML supposed to error out when invalid xml is parsed? It does. > I have an XML file which will grow larger than memory on a target machine, > so here's what I want to do: > > Given a source XML file, and a destination file: > 1) iteratively scan part of the source tree. > 2) Optionally Modify some of scanned tree. > 3) Write partial scan/tree out to the destination file. > 4) Free memory of no-longer needed (partial) source XML. > 5) continue scanning a new section of the source file... eg: goto step 1 > until source file is exhausted. > > But, I don't see a way to write portions of an XML tree, or iteratively > write a tree to disk. > How can this be done? There are several ways to do it. Python has a couple of external libraries available that are made specifically for generating markup incrementally. lxml also gained that feature recently. It's not documented yet, but here are usage examples: https://github.com/lxml/lxml/blob/master/src/lxml/tests/test_incremental_xmlfile.py Stefan -- http://mail.python.org/mailman/listinfo/python-list