Willem Ligtenberg wrote:
Willem Ligtenberg <[EMAIL PROTECTED]> wrote:

I want to parse a very large (2.4 gig) XML file (bioinformatics
ofcourse :)) But I have no clue how to do that. Most things I see read
the entire xml file at once. That isn't going to work here ofcourse.

So I would like to parse a XML file one record at a time and then be
able to store the information in another object.  How should I do
that?

The XML file I need to parse contains information about genes. So the first element is a gene and then there are a lot sub-elements with sub-elements. I only need some of the informtion and want to store it in my an object called gene. Lateron this information will be printed into a file, which in it's turn will be fed into some other program. This is an example of the XML <?xml version="1.0"?> <!DOCTYPE Entrezgene-Set PUBLIC "-//NCBI//NCBI Entrezgene/EN" "NCBI_Entrezgene.dtd"> <Entrezgene-Set> <Entrezgene> <snip> </Entrezgene> </Entrezgene-Set>

This should get you started with cElementTree:

import cElementTree as ElementTree

source = 'Entrezgene.xml'

for event, elem in ElementTree.iterparse(source):
    if elem.tag == 'Entrezgene':
        # Process the Entrezgene element
        geneid = 
elem.findtext('Entrezgene_track-info/Gene-track/Gene-track_geneid')
        print 'Gene id', geneid

        # Throw away the element, we're done with it
        elem.clear()

Kent
--
http://mail.python.org/mailman/listinfo/python-list

Reply via email to