Re: decode Numeric Character References to unicode

2008-02-18 Thread Ben Finney
7stud <[EMAIL PROTECTED]> writes: > For instance, an 'o' with umlaut can be represented in three > different ways: > > '&' followed by 'ouml;' > '&' followed by '#246;' > '&' followed by '#xf6;' The fourth way, of course, is to simply have 'ö' appear directly as a character in the document, and

Re: decode Numeric Character References to unicode

2008-02-18 Thread Duncan Booth
7stud <[EMAIL PROTECTED]> wrote: > On Feb 18, 4:53 am, 7stud <[EMAIL PROTECTED]> wrote: >> On Feb 18, 3:20 am, William Heymann <[EMAIL PROTECTED]> wrote: >> >> > How do I decode a string back to useful unicode that has xml >> > numeric cha > racter >> > references in it? >> >> > Things like 占 #w

Re: decode Numeric Character References to unicode

2008-02-18 Thread 7stud
On Feb 18, 4:53 am, 7stud <[EMAIL PROTECTED]> wrote: > On Feb 18, 3:20 am, William Heymann <[EMAIL PROTECTED]> wrote: > > > How do I decode a string back to useful unicode that has xml numeric > > character > > references in it? > > > Things like 占 #which is: &_#21344_; (without the underscores)

Re: decode Numeric Character References to unicode

2008-02-18 Thread 7stud
On Feb 18, 3:20 am, William Heymann <[EMAIL PROTECTED]> wrote: > How do I decode a string back to useful unicode that has xml numeric character > references in it? > > Things like 占 BeautifulSoup can handle two of the three formats for html entities. For instance, an 'o' with umlaut can be represe

Re: decode Numeric Character References to unicode

2008-02-18 Thread Duncan Booth
William Heymann <[EMAIL PROTECTED]> wrote: > How do I decode a string back to useful unicode that has xml numeric > character references in it? > > Things like 占 > Try something like this: import re from htmlentitydefs import name2codepoint name2codepoint = name2codepoint.copy() name2codepoint