Willem Ligtenberg wrote:
Willem Ligtenberg <[EMAIL PROTECTED]> wrote:
I want to parse a very large (2.4 gig) XML file (bioinformatics
ofcourse :)) But I have no clue how to do that. Most things I see read
the entire xml file at once. That isn't going to work here ofcourse.
So I would like to parse a XML file one record at a time and then be
able to store the information in another object. How should I do
that?
The XML file I need to parse contains information about genes.
So the first element is a gene and then there are a lot sub-elements with
sub-elements. I only need some of the informtion and want to store it in
my an object called gene. Lateron this information will be printed into a
file, which in it's turn will be fed into some other program.
This is an example of the XML
<?xml version="1.0"?>
<!DOCTYPE Entrezgene-Set PUBLIC "-//NCBI//NCBI Entrezgene/EN"
"NCBI_Entrezgene.dtd">
<Entrezgene-Set>
<Entrezgene>
<snip>
</Entrezgene>
</Entrezgene-Set>
This should get you started with cElementTree:
import cElementTree as ElementTree
source = 'Entrezgene.xml'
for event, elem in ElementTree.iterparse(source):
if elem.tag == 'Entrezgene':
# Process the Entrezgene element
geneid =
elem.findtext('Entrezgene_track-info/Gene-track/Gene-track_geneid')
print 'Gene id', geneid
# Throw away the element, we're done with it
elem.clear()
Kent
--
http://mail.python.org/mailman/listinfo/python-list