>>>>> Fafounet <fafou...@gmail.com> (F) wrote: >F> Hello, >F> I am parsing a web page with special chars such as é (which >F> stands for é). >F> I know I can have the unicode character é from unicode >F> ("\xe9","iso-8859-1") >F> but with those extra characters I don' t know.
>F> I tried to implement handle_charref within HTMLParser without success. >F> Furthermore, if I have the data abécd, handle_data will get "ab", >F> handle_charref will get xe9 and then handle_data doesn't have the end >F> of the string ("cd"). The character references indicate Unicode ordinals, not iso-8859-1 characters. In your example it will give the proper character because iso-8859-1 coincides with the first part of the Unicode ordinals, but for character outside of iso-8859-1 it will fail. This should give you an idea: from htmlentitydefs import name2codepoint ... def handle_charref(self, name): if name.startswith('x'): num = int(name[1:], 16) else: num = int(name, 10) print 'char:', repr(unichr(num)) def handle_entityref(self, name): print 'char:', unichr(name2codepoint[name]) If your HTML may be illegal you should add some exception handling. -- Piet van Oostrum <p...@cs.uu.nl> URL: http://pietvanoostrum.com [PGP 8DAE142BE17999C4] Private email: p...@vanoostrum.org -- http://mail.python.org/mailman/listinfo/python-list