On Feb 18, 3:20 am, William Heymann <[EMAIL PROTECTED]> wrote: > How do I decode a string back to useful unicode that has xml numeric character > references in it? > > Things like 占
BeautifulSoup can handle two of the three formats for html entities. For instance, an 'o' with umlaut can be represented in three different ways: &_ouml_; ö ö BeautifulSoup can convert the first two formats to unicode: from BeautifulSoup import BeautifulStoneSoup as BSS my_string = '占' soup = BSS(my_string, convertEntities=BSS.XML_ENTITIES) print soup.contents[0].encode('utf-8') print soup.contents[0] --output:--- <some asian looking character> Traceback (most recent call last): File "test1.py", line 6, in ? print soup.contents[0] UnicodeEncodeError: 'ascii' codec can't encode character u'\u5360' in position 0: ordinal not in range(128) The error message shows you the unicode string that BeautifulSoup produced: u'\u5360' If that won't work for you, it's not hard to write you own conversion function to handle all three formats: 1) Create a regex that will match any of the formats 2) Convert the first format using htmlentitydefs.name2codepoint 3) Convert the second format using unichar() 4) Convert the third format using int('0'+ match, 16) and then unichar() -- http://mail.python.org/mailman/listinfo/python-list