Re: XML parsing per record

Kent Johnson Wed, 20 Apr 2005 05:06:36 -0700

Willem Ligtenberg wrote:

Willem Ligtenberg <[EMAIL PROTECTED]> wrote:

I want to parse a very large (2.4 gig) XML file (bioinformatics
ofcourse :)) But I have no clue how to do that. Most things I see read
the entire xml file at once. That isn't going to work here ofcourse.

So I would like to parse a XML file one record at a time and then be
able to store the information in another object.  How should I do
that?


The XML file I need to parse contains information about genes.
So the first element is a gene and then there are a lot sub-elements with
sub-elements. I only need some of the informtion and want to store it in
my an object called gene. Lateron this information will be printed into a
file, which in it's turn will be fed into some other program.
This is an example of the XML
<?xml version="1.0"?>
<!DOCTYPE Entrezgene-Set PUBLIC "-//NCBI//NCBI Entrezgene/EN" 
"NCBI_Entrezgene.dtd">
<Entrezgene-Set>
  <Entrezgene>
    <snip>
  </Entrezgene>
</Entrezgene-Set>


This should get you started with cElementTree:

import cElementTree as ElementTree

source = 'Entrezgene.xml'

for event, elem in ElementTree.iterparse(source):
    if elem.tag == 'Entrezgene':
        # Process the Entrezgene element
        geneid = 
elem.findtext('Entrezgene_track-info/Gene-track/Gene-track_geneid')
        print 'Gene id', geneid

        # Throw away the element, we're done with it
        elem.clear()

Kent
--
http://mail.python.org/mailman/listinfo/python-list

Re: XML parsing per record

Reply via email to