On Thu, Jun 3, 2010 at 1:44 PM, bfrederi <brfrederi...@gmail.com> wrote: > I am using lxml iterparse and running into a very obscure error. When > I run iterparse on a file, it will occasionally return an element that > has a element.text == None when the element clearly has text in it. > > I copy and pasted the problem xml into a python string, used StringIO > to create a file-like object out of it, and ran a test using iterparse > with expected output, and it ran perfectly fine. So it only happens > when I try to run iterparse on the actual file. > > So then I tried opening the file, reading the data, turning that data > into a file-like object using StringIO, then running iterparse on it, > and the same problem (element.text == None) occurred. > > I even tried this: > f = codecs.open(abbyy_filename, 'r', encoding='utf-8') > file_data = f.read() > file_like_object = StringIO.StringIO(file_data) > for event, element in iterparse(file_like_object, events=("start", > "end")):
IIRC, XML parsers operate on bytes directly (since they have to determine the encoding themselves anyway), not pre-decoded Unicode characters, so I think your manual UTF-8 decoding could be the problem. Have you tried simply: f = open(abbyy_filename, 'r') for event, element in iterparse(f, events=("start", "end")): #whatever ? Apologies if you already have, but since you didn't include the original, albeit probably trivial, error-causing code, this relatively simple error couldn't be ruled out. Cheers, Chris -- http://blog.rebertia.com -- http://mail.python.org/mailman/listinfo/python-list