Thank you both for the suggestions. I made a few more experiments to understand how iterparse behaves with respect to three dimensions:
a. Is the encoding declared in the header (if there is one) ? b. Is the text ascii-encodable (i.e. within range(128)) ? c. Does the passed file object's read() method return str or unicode (e.g. codecs.open(f,encoding='utf8')) ? Feel free to correct me if I misinterpreted what is really happening. As John Krukoff mentioned, omitting the encoding is equivalent to encoding="utf-8" for all other combinations. This leaves (b) and (c). If a text node is ascii-encodable, iterparse() returns it as a byte string, regardless of the declared encoding and the input file's read() return type. (c) becomes relevant only if a text node is not ascii-encodable. In this case iterparse() returns unicode if the underlying file's read() returns bytes in an encoding that matches (or at least is compatible with) the declared encoding in the header (or the implied utf8). Passing a file object whose read() returns unicode characters implicitly encodes them to ascii, which raises a UnicodeEncodeError since the text node is not ascii-encodable. It's interesting that the element text attributes after a successful parse do not necessarily have the same type, i.e. all be str or all unicode. I ported some text extraction code from BeautifulSoup (which handles all text as unicode) and I was surprized to find out that in xml.etree the returned text's type is not fixed, even within the same file. Although it's not a bug, having a mixed collection of byte and unicode strings from the same source makes me somewhat uneasy. George -- http://mail.python.org/mailman/listinfo/python-list