Stefan Behnel wrote:
bryan rasmussen top-posted:
On Thu, Apr 24, 2008 at 9:55 PM, Stefan Behnel <[EMAIL PROTECTED]> wrote:
from lxml import etree
tree = etree.parse("thefile.xhtml")
tree.write("thefile.html", method="html")
http://codespeak.net/lxml
wow, that's pretty nice there.
Just to know: what's the performance like on XML instances of 1 GB?
That's a pretty big file, although you didn't mention what kind of XML
language you want to handle and what you want to do with it.
lxml is pretty conservative in terms of memory:
http://blog.ianbicking.org/2008/03/30/python-html-parser-performance/
But the exact numbers depend on your data. lxml holds the XML tree in memory,
which is a lot bigger than the serialised data. So, for example, if you have
2GB of RAM and want to parse a serialised 1GB XML file full of little
one-element integers into an in-memory tree, get prepared for lunch. With a
lot of long text string content instead, it might still fit.
However, lxml also has a couple of step-by-step and stream parsing APIs:
http://codespeak.net/lxml/parsing.html#the-target-parser-interface
http://codespeak.net/lxml/parsing.html#the-feed-parser-interface
http://codespeak.net/lxml/parsing.html#iterparse-and-iterwalk
If you are operating with huge XML files (say, larger than available
RAM) repeatedly, an XML database may also be a good option.
My current favorite in this realm is Sedna (free, Apache 2.0 license).
Among other features, it has facilities for indexing within documents
and collections (faster queries) and transactional sub-document updates
(safely modify parts of a document without rewriting the entire
document). I have been working on a python interface to it recently
(zif.sedna, in pypi).
Regarding RAM consumption, a Sedna database uses approximately 100 MB of
RAM by default, and that does not change much, no matter how much (or
how little) data is actually stored.
For a quick idea of Sedna's capabilities, the Sedna folks have put up an
on-line demo serving and xquerying an extract from Wikipedia (in the
range of 20 GB of data) using a Sedna server, at
http://wikidb.dyndns.org/ . Along with the on-line demo, they provide
instructions for deploying the technology locally.
- Jim Washington
--
http://mail.python.org/mailman/listinfo/python-list