Denis S. Otkidach <denis.otkid...@gmail.com> added the comment: Here is a regexp I use to clean up text (note, that I don't touch "compatibility characters" that are also not recommended in XML; some other developers remove them too):
# http://www.w3.org/TR/REC-xml/#NT-Char # Char ::= #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | # [#x10000- #x10FFFF] # (any Unicode character, excluding the surrogate blocks, FFFE, and FFFF) _char_tail = '' if sys.maxunicode > 0x10000: _char_tail = u'%s-%s' % (unichr(0x10000), unichr(min(sys.maxunicode, 0x10FFFF))) _nontext_sub = re.compile( ur'[^\x09\x0A\x0D\x20-\uD7FF\uE000-\uFFFD%s]' % _char_tail, re.U).sub def replace_nontext(text, replacement=u'\uFFFD'): return _nontext_sub(replacement, text) ---------- _______________________________________ Python tracker <rep...@bugs.python.org> <http://bugs.python.org/issue5166> _______________________________________ _______________________________________________ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com