Re: Special chars with HTMLParser

2009-08-07 Thread Stefan Behnel
Fafounet wrote: > I am parsing a web page with special chars such as é (which > stands for é). > I know I can have the unicode character é from unicode > ("\xe9","iso-8859-1") > but with those extra characters I don' t know. > > I tried to implement handle_charref within HTMLParser without success

Re: Special chars with HTMLParser

2009-08-05 Thread Piet van Oostrum
> Fafounet (F) wrote: >F> Thank you, now I can get the correct character. >F> Now when I have the string abécd I can get ab then é thanks to >F> your function and then cd. But how is it possible to know that cd is >F> still the same word ? That depends on your definition of `word'. And that

Re: Special chars with HTMLParser

2009-08-05 Thread Fafounet
Thank you, now I can get the correct character. Now when I have the string abécd I can get ab then é thanks to your function and then cd. But how is it possible to know that cd is still the same word ? Fabien > The character references indicate Unicode ordinals, not iso-8859-1 > characters. In

Re: Special chars with HTMLParser

2009-08-05 Thread Piet van Oostrum
> Fafounet (F) wrote: >F> Hello, >F> I am parsing a web page with special chars such as é (which >F> stands for é). >F> I know I can have the unicode character é from unicode >F> ("\xe9","iso-8859-1") >F> but with those extra characters I don' t know. >F> I tried to implement handle_charref

Special chars with HTMLParser

2009-08-05 Thread Fafounet
Hello, I am parsing a web page with special chars such as é (which stands for é). I know I can have the unicode character é from unicode ("\xe9","iso-8859-1") but with those extra characters I don' t know. I tried to implement handle_charref within HTMLParser without success. Furthermore, if I ha