[issue5166] ElementTree and minidom don't prevent creation of not well-formed XML

Denis S. Otkidach Tue, 24 Nov 2009 09:27:36 -0800

Denis S. Otkidach <[email protected]> added the comment:

Here is a regexp I use to clean up text (note, that I don't touch 
"compatibility characters" that are also not recommended in XML; some 
other developers remove them too):


# http://www.w3.org/TR/REC-xml/#NT-Char
# Char ::= #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | 
#          [#x10000- #x10FFFF]
# (any Unicode character, excluding the surrogate blocks, FFFE, and 
FFFF)
_char_tail = ''
if sys.maxunicode > 0x10000:
    _char_tail = u'%s-%s' % (unichr(0x10000),
                             unichr(min(sys.maxunicode, 0x10FFFF)))
_nontext_sub = re.compile(
                ur'[^\x09\x0A\x0D\x20-\uD7FF\uE000-\uFFFD%s]' % 
_char_tail,
                re.U).sub
def replace_nontext(text, replacement=u'\uFFFD'):
    return _nontext_sub(replacement, text)

----------

_______________________________________
Python tracker <[email protected]>
<http://bugs.python.org/issue5166>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue5166] ElementTree and minidom don't prevent creation of not well-formed XML

Reply via email to