Don't assume that just because you have a 2.4G XML file that you have 2.4G of data. Looking at these verbose tags, plus the fact that the XML is pretty-printed (all those leading spaces - not even tabs! - add up), I'm guessing you only have about 5-10% actual data, and the rest is just XML tagging/untagging and spaces. (For example, 373 characters used to represent a date/time - this is a sin!)
As XML goes, this looks pretty dead easy to parse with non-XML parser means. It looks like all of your leaf nodes open and close on the same line, which would be easy to extract with regexp's or pyparsing. Especially since you mention "I only need some of the informtion", you don't even have to build a full document tree representation. SAX parsers would also be good, since you could only trigger on the matching subset of tags that you are really interested in. Lastly, you could even try a pyparsing approach. I usually don't recommend pyparsing for XML since there are already many good XML-targeted tools out there, but it is very easy to throw together something in pyparsing that extracts, say, all of the <object-id_id> entries, or all of the <gene-source> structures. What is the subset of information you are looking to extract? -- Paul -- http://mail.python.org/mailman/listinfo/python-list