Trying to parse a HUGE(1gb) xml file

2010-12-20 Thread spaceman-spiff
Hi c.l.p folks

This is a rather long post, but i wanted to include all the details & 
everything i have tried so far myself, so please bear with me & read the entire 
boringly long post.

I am trying to parse a ginormous ( ~ 1gb) xml file.


0. I am a python & xml n00b, s& have been relying on the excellent beginner 
book DIP(Dive_Into_Python3 by MP(Mark Pilgrim) Mark , if u are readng this, 
you are AWESOME & so is your witty & humorous writing style)


1. Almost all exmaples pf parsing xml in python, i have seen, start off with 
these 4 lines of code.

import xml.etree.ElementTree as etree
tree = etree.parse('*path_to_ginormous_xml*')
root = tree.getroot()  #my huge xml has 1 root at the top level
print root

2. In the 2nd line of code above, as Mark explains in DIP, the parse function 
builds & returns a tree object, in-memory(RAM), which represents the entire 
document.
I tried this code, which works fine for a small ( ~ 1MB), but when i run this 
simple 4 line py code in a terminal for my HUGE target file (1GB), nothing 
happens.
In a separate terminal, i run the top command, & i can see a python process, 
with memory (the VIRT column) increasing from 100MB , all the way upto 2100MB.

I am guessing, as this happens (over the course of 20-30 mins), the tree 
representing is being slowly built in memory, but even after 30-40 mins, 
nothing happens.
I dont get an error, seg fault or out_of_memory exception.

My hardware setup : I have a win7 pro box with 8gb of RAM & intel core2 quad 
cpuq9400.
On this i am running sun virtualbox(3.2.12), with ubuntu 10.10 as guest os, 
with 23gb disk space & 2gb(2048mb) ram, assigned to the guest ubuntu os.

3. I also tried using lxml, but an lxml tree is much more expensive, as it 
retains more info about a node's context, including references to it's parent.
[http://www.ibm.com/developerworks/xml/library/x-hiperfparse/]

When i ran the same 4line code above, but with lxml's elementree ( using the 
import below in line1of the code above)
import lxml.etree as lxml_etree

i can see the memory consumption of the python process(which is running the 
code) shoot upto ~ 2700mb & then, python(or the os ?) kills the process as it 
nears the total system memory(2gb)

I ran the code from 1 terminal window (screenshot :http://imgur.com/ozLkB.png)
& ran top from another terminal (http://imgur.com/HAoHA.png)

4. I then investigated some streaming libraries, but am confused - there is 
SAX[http://en.wikipedia.org/wiki/Simple_API_for_XML] , the iterparse 
interface[http://effbot.org/zone/element-iterparse.htm]

Which one is the best for my situation ?

Any & all code_snippets/wisdom/thoughts/ideas/suggestions/feedback/comments/ of 
the c.l.p community would be greatly appreciated.
Plz feel free to email me directly too.

thanks a ton

cheers
ashish

email : 
ashish.makani
domain:gmail.com

p.s.
Other useful links on xml parsing in python
0. http://diveintopython3.org/xml.html
1. 
http://stackoverflow.com/questions/1513592/python-is-there-an-xml-parser-implemented-as-a-generator
2. http://codespeak.net/lxml/tutorial.html
3. 
https://groups.google.com/forum/?hl=en&lnk=gst&q=parsing+a+huge+xml#!topic/comp.lang.python/CMgToEnjZBk
4. http://www.ibm.com/developerworks/xml/library/x-hiperfparse/
5.http://effbot.org/zone/element-index.htm
http://effbot.org/zone/element-iterparse.htm
6. SAX : http://en.wikipedia.org/wiki/Simple_API_for_XML


-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Trying to parse a HUGE(1gb) xml file

2010-12-20 Thread spaceman-spiff
Hi Usernet

First up, thanks for your prompt reply.
I will make sure i read RFC1855, before posting again, but right now chasing a 
hard deadline :)

I am sorry i left out what exactly i am trying to do.

0. Goal :I am looking for a specific element..there are several 10s/100s 
occurrences of that element in the 1gb xml file.
The contents of the xml, is just a dump of config parameters from a packet 
switch( although imho, the contents of the xml dont matter)

I need to detect them & then for each 1, i need to copy all the content b/w the 
element's start & end tags & create a smaller xml file.

1. Can you point me to some examples/samples of using SAX, especially , ones 
dealing with really large XML files.

2.This brings me to another q. which i forgot to ask in my OP(original post).
Is simply opening the file, & using reg ex to look for the element i need, a 
*good* approach ?
While researching my problem, some article seemed to advise against this, 
especially since its known apriori, that the file is an xml & since regex code 
gets complicated very quickly & is not very readable.

But is that just a "style"/"elegance" issue, & for my particular problem 
(detecting a certain element, & then creating(writing) a smaller xml file 
corresponding to, each pair of start & end tags of said element), is the open 
file & regex approach, something you would recommend ?

Thanks again for your super-prompt response :)

cheers
ashish
-- 
http://mail.python.org/mailman/listinfo/python-list