Re: decode Numeric Character References to unicode

Duncan Booth Mon, 18 Feb 2008 03:21:10 -0800

William Heymann <[EMAIL PROTECTED]> wrote:

> How do I decode a string back to useful unicode that has xml numeric
> character references in it?
> 
> Things like &#21344;
> 
Try something like this:


import re
from htmlentitydefs import name2codepoint

name2codepoint = name2codepoint.copy()
name2codepoint['apos']=ord("'")

EntityPattern = re.compile('&(?:#(\d+)|(?:#x([\da-fA-F]+))|([a-zA-Z]+));')

def decodeEntities(s, encoding='utf-8'):
    def unescape(match):
        code = match.group(1)
        if code:
            return unichr(int(code, 10))
        else:
            code = match.group(2)
            if code:
                return unichr(int(code, 16))
            else:
                code = match.group(3)
                if code in name2codepoint:
                    return unichr(name2codepoint[code])
        return match.group(0)

    return EntityPattern.sub(unescape, s.decode(encoding))

Obviously if you really do only want numeric references you can take out 
the lines using name2codepoint and simplify the regex.
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: decode Numeric Character References to unicode

Reply via email to