Am 20.12.2010 20:34, schrieb spaceman-spiff:
Hi c.l.p folks
This is a rather long post, but i wanted to include all the details& everything i
have tried so far myself, so please bear with me& read the entire boringly long
post.
I am trying to parse a ginormous ( ~ 1gb) xml file.
0. I am a python& xml n00b, s& have been relying on the excellent beginner book
DIP(Dive_Into_Python3 by MP(Mark Pilgrim).... Mark , if u are readng this, you are AWESOME&
so is your witty& humorous writing style)
1. Almost all exmaples pf parsing xml in python, i have seen, start off with
these 4 lines of code.
import xml.etree.ElementTree as etree
tree = etree.parse('*path_to_ginormous_xml*')
root = tree.getroot() #my huge xml has 1 root at the top level
print root
2. In the 2nd line of code above, as Mark explains in DIP, the parse function
builds& returns a tree object, in-memory(RAM), which represents the entire
document.
I tried this code, which works fine for a small ( ~ 1MB), but when i run this
simple 4 line py code in a terminal for my HUGE target file (1GB), nothing
happens.
In a separate terminal, i run the top command,& i can see a python process,
with memory (the VIRT column) increasing from 100MB , all the way upto 2100MB.
I am guessing, as this happens (over the course of 20-30 mins), the tree
representing is being slowly built in memory, but even after 30-40 mins,
nothing happens.
I dont get an error, seg fault or out_of_memory exception.
My hardware setup : I have a win7 pro box with 8gb of RAM& intel core2 quad
cpuq9400.
On this i am running sun virtualbox(3.2.12), with ubuntu 10.10 as guest os, with
23gb disk space& 2gb(2048mb) ram, assigned to the guest ubuntu os.
3. I also tried using lxml, but an lxml tree is much more expensive, as it
retains more info about a node's context, including references to it's parent.
[http://www.ibm.com/developerworks/xml/library/x-hiperfparse/]
When i ran the same 4line code above, but with lxml's elementree ( using the
import below in line1of the code above)
import lxml.etree as lxml_etree
i can see the memory consumption of the python process(which is running the code)
shoot upto ~ 2700mb& then, python(or the os ?) kills the process as it nears
the total system memory(2gb)
I ran the code from 1 terminal window (screenshot :http://imgur.com/ozLkB.png)
& ran top from another terminal (http://imgur.com/HAoHA.png)
4. I then investigated some streaming libraries, but am confused - there is
SAX[http://en.wikipedia.org/wiki/Simple_API_for_XML] , the iterparse
interface[http://effbot.org/zone/element-iterparse.htm]
Which one is the best for my situation ?
Any& all code_snippets/wisdom/thoughts/ideas/suggestions/feedback/comments/ of
the c.l.p community would be greatly appreciated.
Plz feel free to email me directly too.
thanks a ton
cheers
ashish
email :
ashish.makani
domain:gmail.com
p.s.
Other useful links on xml parsing in python
0. http://diveintopython3.org/xml.html
1.
http://stackoverflow.com/questions/1513592/python-is-there-an-xml-parser-implemented-as-a-generator
2. http://codespeak.net/lxml/tutorial.html
3.
https://groups.google.com/forum/?hl=en&lnk=gst&q=parsing+a+huge+xml#!topic/comp.lang.python/CMgToEnjZBk
4. http://www.ibm.com/developerworks/xml/library/x-hiperfparse/
5.http://effbot.org/zone/element-index.htm
http://effbot.org/zone/element-iterparse.htm
6. SAX : http://en.wikipedia.org/wiki/Simple_API_for_XML
Normally (what is normal, anyway?) such files are auto-generated,
and are something that has a apparent similarity with a database query
result, encapsuled in xml.
Most of the time the structure is same for every "row" thats in there.
So, a very unpythonic but fast, way would be to let awk resemble the
records and write them in csv format to stdout.
then pipe that to your python cruncher of choice and let it do the hard
work.
The awk part can be done in python, anyway, so could skip that.
And take a look at xmlsh.org, they offer tools for the command line,
like xml2csv. (Need java, btw).
Cheers
--
http://mail.python.org/mailman/listinfo/python-list