When to clear elements using cElementTree
Hi there, I am parsing some huge xml files (1.8 Gb) that look like this: some data some data some data What I am trying to do is build up a dictionary of lists where the key is the parent scan num and the members of the list are the child scan nums. I have created an iterator: for event, elem in cElementTree.iterparse(filename): if elem.tag == self.XML_SPACE + "scan": parentId = int(elem.get('num')) for child in elem.findall(self.XML_SPACE +'scan'): try: indexes = scans[parentId] except KeyError: indexes = [] scans[parentId] = indexes childId = int(child.get('num')) indexes.append(childId) # choice 1 - child.clear() #choice 2 - elem.clear() #choice 3 - elem.clear() If I don't use any of the clear functions, the method works fine, but is very slow (presumably because nothing is getting cleared from memory. But, if I implement any of the clear functions shown, then childId = int(child.get('num')) fails because child.get('num') returns a NoneType. If you dump the child element using cElementTree.dump(child), all of the attributes on the child scans are missing, even though the clear() calls are made after the assignment of the childId. What I don't understand is why, given the calls are made after assignment, that the assignment then fails, but succeeds when clear() is not called. When should I be calling clear() in this case to maximize speed? Many thanks, Ben -- http://mail.python.org/mailman/listinfo/python-list
Re: When to clear elements using cElementTree
I managed to solve this using the following method: """Returns a dictionary of indexes of spectra for which there are secondary scans, along with the indexes of those scans """ scans = dict() # get an iterable context = cElementTree.iterparse(self.info['filename'], events=("end",)) # turn it into an iterator context = iter(context) # get the root element event, root = context.next() for event, elem in context: if event == "end" and elem.tag == self.XML_SPACE + "scan": parentId = int(elem.get('num')) for child in elem.findall(self.XML_SPACE + 'scan'): childId = int(child.get('num')) try: indexes = scans[parentId] except KeyError: indexes = [] scans[parentId] = indexes indexes.append(childId) child.clear() root.clear() return scans I think the trick is using the 'end' event to determine how much data your iterparse is taking in, but I'm still not quite clear on whether this is the best way to do it. -- http://mail.python.org/mailman/listinfo/python-list
finding the byte offset of an element in an XML file (tell() and seek()?)
Hi there, I am working with mass spectroscopy data in the mzXML format that looks like this: ... ... ... ... . 160409990 160442725 160474927 160497386 Where the offset element contains the byte offset of the scan element that shares the id. I am trying to write a python script to remove scan elements and their respective offset, but I can't figure out how I re-calculate the byte offset for each remaining element once the elements have been removed. My plan was to write the file out, the read it back in again and search through the file for a particular string (e.g. '') and then use the tell() method to return the current byte location in the file. However, I'm not sure how I would implement this. Any ideas? Many thanks, Ben -- http://mail.python.org/mailman/listinfo/python-list