Re: Special chars with HTMLParser

Piet van Oostrum Wed, 05 Aug 2009 05:31:32 -0700

>>>>> Fafounet <fafou...@gmail.com> (F) wrote:

>F> Hello,
>F> I am parsing a web page with special chars such as &#xE9; (which
>F> stands for é).
>F> I know I can have the unicode character é from unicode
>F> ("\xe9","iso-8859-1")
>F> but with those extra characters I don' t know.


>F> I tried to implement handle_charref within HTMLParser without success.
>F> Furthermore, if I have the data ab&#xE9;cd, handle_data will get "ab",
>F> handle_charref will get xe9 and then handle_data doesn't have the end
>F> of the string ("cd").

The character references indicate Unicode ordinals, not iso-8859-1
characters. In your example it will give the proper character because
iso-8859-1 coincides with the first part of the Unicode ordinals, but
for character outside of iso-8859-1 it will fail.

This should give you an idea:

from htmlentitydefs import name2codepoint
...
    def handle_charref(self, name):
        if name.startswith('x'):
            num = int(name[1:], 16)
        else:
            num = int(name, 10)
        print 'char:', repr(unichr(num))

    def handle_entityref(self, name):
        print 'char:', unichr(name2codepoint[name])
        
If your HTML may be illegal you should add some exception handling.
-- 
Piet van Oostrum <p...@cs.uu.nl>
URL: http://pietvanoostrum.com [PGP 8DAE142BE17999C4]
Private email: p...@vanoostrum.org
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Special chars with HTMLParser

Reply via email to