Re: Trying to parse a HUGE(1gb) xml file

Stefan Sonnenberg-Carstens Wed, 22 Dec 2010 15:02:04 -0800

Am 20.12.2010 20:34, schrieb spaceman-spiff:

Hi c.l.p folks


This is a rather long post, but i wanted to include all the details&  everything i 
have tried so far myself, so please bear with me&  read the entire boringly long 
post.

I am trying to parse a ginormous ( ~ 1gb) xml file.


0. I am a python&  xml n00b, s&  have been relying on the excellent beginner book 
DIP(Dive_Into_Python3 by MP(Mark Pilgrim).... Mark , if u are readng this, you are AWESOME& 
 so is your witty&  humorous writing style)


1. Almost all exmaples pf parsing xml in python, i have seen, start off with 
these 4 lines of code.

import xml.etree.ElementTree as etree
tree = etree.parse('*path_to_ginormous_xml*')
root = tree.getroot()  #my huge xml has 1 root at the top level
print root

2. In the 2nd line of code above, as Mark explains in DIP, the parse function 
builds&  returns a tree object, in-memory(RAM), which represents the entire 
document.
I tried this code, which works fine for a small ( ~ 1MB), but when i run this 
simple 4 line py code in a terminal for my HUGE target file (1GB), nothing 
happens.
In a separate terminal, i run the top command,&  i can see a python process, 
with memory (the VIRT column) increasing from 100MB , all the way upto 2100MB.

I am guessing, as this happens (over the course of 20-30 mins), the tree 
representing is being slowly built in memory, but even after 30-40 mins, 
nothing happens.
I dont get an error, seg fault or out_of_memory exception.

My hardware setup : I have a win7 pro box with 8gb of RAM&  intel core2 quad 
cpuq9400.
On this i am running sun virtualbox(3.2.12), with ubuntu 10.10 as guest os, with 
23gb disk space&  2gb(2048mb) ram, assigned to the guest ubuntu os.

3. I also tried using lxml, but an lxml tree is much more expensive, as it 
retains more info about a node's context, including references to it's parent.
[http://www.ibm.com/developerworks/xml/library/x-hiperfparse/]

When i ran the same 4line code above, but with lxml's elementree ( using the 
import below in line1of the code above)
import lxml.etree as lxml_etree

i can see the memory consumption of the python process(which is running the code) 
shoot upto ~ 2700mb&  then, python(or the os ?) kills the process as it nears 
the total system memory(2gb)

I ran the code from 1 terminal window (screenshot :http://imgur.com/ozLkB.png)
&  ran top from another terminal (http://imgur.com/HAoHA.png)

4. I then investigated some streaming libraries, but am confused - there is 
SAX[http://en.wikipedia.org/wiki/Simple_API_for_XML] , the iterparse 
interface[http://effbot.org/zone/element-iterparse.htm]

Which one is the best for my situation ?

Any&  all code_snippets/wisdom/thoughts/ideas/suggestions/feedback/comments/ of 
the c.l.p community would be greatly appreciated.
Plz feel free to email me directly too.

thanks a ton

cheers
ashish

email :
ashish.makani
domain:gmail.com

p.s.
Other useful links on xml parsing in python
0. http://diveintopython3.org/xml.html
1. 
http://stackoverflow.com/questions/1513592/python-is-there-an-xml-parser-implemented-as-a-generator
2. http://codespeak.net/lxml/tutorial.html
3. 
https://groups.google.com/forum/?hl=en&lnk=gst&q=parsing+a+huge+xml#!topic/comp.lang.python/CMgToEnjZBk
4. http://www.ibm.com/developerworks/xml/library/x-hiperfparse/
5.http://effbot.org/zone/element-index.htm
http://effbot.org/zone/element-iterparse.htm
6. SAX : http://en.wikipedia.org/wiki/Simple_API_for_XML

Normally (what is normal, anyway?) such files are auto-generated,

and are something that has a apparent similarity with a database queryresult, encapsuled in xml.

Most of the time the structure is same for every "row" thats in there.

So, a very unpythonic but fast, way would be to let awk resemble therecords and write them in csv format to stdout.then pipe that to your python cruncher of choice and let it do the hardwork.

The awk part can be done in python, anyway, so could skip that.

And take a look at xmlsh.org, they offer tools for the command line,like xml2csv. (Need java, btw).


Cheers

--
http://mail.python.org/mailman/listinfo/python-list

Re: Trying to parse a HUGE(1gb) xml file

Reply via email to