Re: decode Numeric Character References to unicode

7stud Mon, 18 Feb 2008 03:57:19 -0800

On Feb 18, 3:20 am, William Heymann <[EMAIL PROTECTED]> wrote:
> How do I decode a string back to useful unicode that has xml numeric character
> references in it?
>
> Things like &#21344;


BeautifulSoup can handle two of the three formats for html entities.
For instance, an 'o' with umlaut can be represented in three different
ways:

&_ouml_;
&#246;
&#xf6;

BeautifulSoup can convert the first two formats to unicode:

from BeautifulSoup import BeautifulStoneSoup as BSS

my_string = '&#21344;'
soup = BSS(my_string, convertEntities=BSS.XML_ENTITIES)
print soup.contents[0].encode('utf-8')
print soup.contents[0]

--output:---
<some asian looking character>

Traceback (most recent call last):
  File "test1.py", line 6, in ?
    print soup.contents[0]
UnicodeEncodeError: 'ascii' codec can't encode character u'\u5360' in
position 0: ordinal not in range(128)

The error message shows you the unicode string that BeautifulSoup
produced: u'\u5360'

If that won't work for you, it's not hard to write you own conversion
function to handle all three formats:

1) Create a regex that will match any of the formats
2) Convert the first format using htmlentitydefs.name2codepoint
3) Convert the second format using unichar()
4) Convert the third format using int('0'+ match, 16) and then
unichar()
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: decode Numeric Character References to unicode

Reply via email to