Steven D'Aprano wrote: > A few issues: > > (1) It doesn't seem to be reversible: > >>>> '© and many more...'.decode('latin-1') > u'© and many more...' > > What should I do instead?
Unfortunately, there's nothing in the standard library that can do that, as far as I know. You'll have to write your own function. Here's one I've used before (partially stolen from code in Python patch #912410 which was written by Aaron Swartz): from htmlentitydefs import name2codepoint import re def _replace_entity(m): s = m.group(1) if s[0] == u'#': s = s[1:] try: if s[0] in u'xX': c = int(s[1:], 16) else: c = int(s) return unichr(c) except ValueError: return m.group(0) else: try: return unichr(name2codepoint[s]) except (ValueError, KeyError): return m.group(0) _entity_re = re.compile(r"&(#?[xX]?(?:[0-9a-fA-F]+|\w{1,8}));") def unescape(s): return _entity_re.sub(_replace_entity, s) > (2) Are XML entities guaranteed to be the same as HTML entities? XML defines one entity which doesn't exist in HTML: '. But xmlcharrefreplace only generates numeric character references, and those should be the same between XML and HTML. > (3) Is there a way to find out at runtime what encoders/decoders/error > handlers are available, and what they do? From what I remember, that's not possible because the codec system is designed so that functions taking names are registered instead of the names themselves. But all of the standard codecs are documented at <http://python.org/doc/current/lib/standard-encodings.html>, and all of the standard error handlers are documented at <http://python.org/doc/current/lib/codec-base-classes.html>. -- http://mail.python.org/mailman/listinfo/python-list