When to clear elements using cElementTree

2012-10-18 Thread Ben Temperton
Hi there, I am parsing some huge xml files (1.8 Gb) that look like this:

some data

some data


some data



What I am trying to do is build up a dictionary of lists where the key is the 
parent scan num and the members of the list are the child scan nums.

I have created an iterator:

for event, elem in cElementTree.iterparse(filename):
if elem.tag == self.XML_SPACE + "scan":
parentId = int(elem.get('num'))
for child in elem.findall(self.XML_SPACE +'scan'):
try:
indexes = scans[parentId]
except KeyError:
 indexes = []
 scans[parentId] = indexes
 childId = int(child.get('num'))
 indexes.append(childId)
# choice 1 - child.clear()
   #choice 2 - elem.clear()
#choice 3 - elem.clear()

If I don't use any of the clear functions, the method works fine, but is very 
slow (presumably because nothing is getting cleared from memory. But, if I 
implement any of the clear functions shown, then 

childId = int(child.get('num'))

fails because child.get('num') returns a NoneType. If you dump the child 
element using cElementTree.dump(child), all of the attributes on the child 
scans are missing, even though the clear() calls are made after the assignment 
of the childId.

What I don't understand is why, given the calls are made after assignment, that 
the assignment then fails, but succeeds when clear() is not called.

When should I be calling clear() in this case to maximize speed?

Many thanks,

Ben



-- 
http://mail.python.org/mailman/listinfo/python-list


Re: When to clear elements using cElementTree

2012-10-19 Thread Ben Temperton
I managed to solve this using the following method:

"""Returns a dictionary of indexes of spectra for which there are secondary 
scans, along with the indexes of those scans
"""
scans = dict()

# get an iterable
context = cElementTree.iterparse(self.info['filename'], events=("end",))

# turn it into an iterator
context = iter(context)

# get the root element
event, root = context.next()

for event, elem in context:
if event == "end" and elem.tag == self.XML_SPACE + "scan":
parentId = int(elem.get('num'))
for child in elem.findall(self.XML_SPACE + 'scan'):
childId = int(child.get('num'))
try:
indexes = scans[parentId]
except KeyError:
indexes = []
scans[parentId] = indexes
indexes.append(childId)
child.clear()
root.clear()
return scans

I think the trick is using the 'end' event to determine how much data your 
iterparse is taking in, but I'm still not quite clear on whether this is the 
best way to do it.
-- 
http://mail.python.org/mailman/listinfo/python-list


finding the byte offset of an element in an XML file (tell() and seek()?)

2012-06-14 Thread Ben Temperton
Hi there,

I am working with mass spectroscopy data in the mzXML format that looks like 
this:


  ...
  ...
  ...
  ...
 .


160409990
160442725
160474927
160497386




Where the offset element contains the byte offset of the scan element that 
shares the id. I am trying to write a python script to remove scan elements and 
their respective offset, but I can't figure out how I re-calculate the byte 
offset for each remaining element once the elements have been removed.

My plan was to write the file out, the read it back in again and search through 
the file for a particular string (e.g. '') and then use the 
tell() method to return the current byte location in the file. However, I'm not 
sure how I would implement this.

Any ideas?

Many thanks,

Ben
-- 
http://mail.python.org/mailman/listinfo/python-list