William Heymann <[EMAIL PROTECTED]> wrote: > How do I decode a string back to useful unicode that has xml numeric > character references in it? > > Things like 占 > Try something like this:
import re from htmlentitydefs import name2codepoint name2codepoint = name2codepoint.copy() name2codepoint['apos']=ord("'") EntityPattern = re.compile('&(?:#(\d+)|(?:#x([\da-fA-F]+))|([a-zA-Z]+));') def decodeEntities(s, encoding='utf-8'): def unescape(match): code = match.group(1) if code: return unichr(int(code, 10)) else: code = match.group(2) if code: return unichr(int(code, 16)) else: code = match.group(3) if code in name2codepoint: return unichr(name2codepoint[code]) return match.group(0) return EntityPattern.sub(unescape, s.decode(encoding)) Obviously if you really do only want numeric references you can take out the lines using name2codepoint and simplify the regex. -- http://mail.python.org/mailman/listinfo/python-list