ElementTree, XML and Unicode -- C0 Controls

Sébastien Boisgérault Mon, 11 Dec 2006 07:29:22 -0800

Hi all,

The unicode code points in the 0000-001F range --
except newline, tab, carriage return -- are not legal
XML 1.0 characters.


Attempts to serialize and deserialize such strings
with ElementTree will fail:

>>> elt = Element("root", char=u"\u0000")
>>> xml = tostring(elt)
>>> xml
'<root char="\x00" />'
>>> fromstring(xml)
   [...]
xml.parsers.expat.ExpatError: not well-formed (invalid token): line 1,
column 12

Good ! But I was expecting a failure *earlier*, in
the "tostring" function -- I basically assumed that
ElementTree would refuse to generate a XML
fragment that is not well-formed.

Could anyone comment on the rationale behind
the current behavior ? Is it a performance issue,
the search for non-valid unicode code points being
too expensive ?

Cheers,

SB

-- 
http://mail.python.org/mailman/listinfo/python-list

ElementTree, XML and Unicode -- C0 Controls

Reply via email to