I'll first try it using SAX, because I want to have as little dependancies as possible. I already have BioPython as a dependancy. And I personally don't like to install lot's of packages for a program to work. So I don't want to impose that on other people. But thanks anyway and I might go for the cElementTree later on, if the ordinary SAX proves to slow...
On Wed, 20 Apr 2005 08:03:00 -0400, Kent Johnson wrote: > Willem Ligtenberg wrote: >>>Willem Ligtenberg <[EMAIL PROTECTED]> wrote: >>> >>>>I want to parse a very large (2.4 gig) XML file (bioinformatics >>>>ofcourse :)) But I have no clue how to do that. Most things I see read >>>>the entire xml file at once. That isn't going to work here ofcourse. >>>> >>>>So I would like to parse a XML file one record at a time and then be >>>>able to store the information in another object. How should I do >>>>that? >> >> The XML file I need to parse contains information about genes. >> So the first element is a gene and then there are a lot sub-elements with >> sub-elements. I only need some of the informtion and want to store it in >> my an object called gene. Lateron this information will be printed into a >> file, which in it's turn will be fed into some other program. >> This is an example of the XML >> <?xml version="1.0"?> >> <!DOCTYPE Entrezgene-Set PUBLIC "-//NCBI//NCBI Entrezgene/EN" >> "NCBI_Entrezgene.dtd"> >> <Entrezgene-Set> >> <Entrezgene> >> <snip> >> </Entrezgene> >> </Entrezgene-Set> > > This should get you started with cElementTree: > > import cElementTree as ElementTree > > source = 'Entrezgene.xml' > > for event, elem in ElementTree.iterparse(source): > if elem.tag == 'Entrezgene': > # Process the Entrezgene element > geneid = > elem.findtext('Entrezgene_track-info/Gene-track/Gene-track_geneid') > print 'Gene id', geneid > > # Throw away the element, we're done with it > elem.clear() > > Kent -- http://mail.python.org/mailman/listinfo/python-list