Re: Special chars with HTMLParser

Piet van Oostrum Wed, 05 Aug 2009 11:26:28 -0700

>>>>> Fafounet <[email protected]> (F) wrote:

>F> Thank you, now I can get the correct character.
>F> Now when I have the string ab&#xE9;cd I can get ab then é thanks to
>F> your function and then cd. But how is it possible to know that cd is
>F> still the same word ?


That depends on your definition of `word'. And that is
language-dependent. 

What you normally do is collect the text in a (unicode) string variable.
This happens in handle_data, handle_charref and handle_entityref.
Then you check that the previously collected stuff was a word (e.g.
consisting of Unicode letters), and that the new stuff also consists of
letters. If your language has additional word constituents like - or '
you have to add this.

You can do this with unicodedata.category or with a regular
expression. If your locale is correct \w in a regular expression may be
helpful. 
-- 
Piet van Oostrum <[email protected]>
URL: http://pietvanoostrum.com [PGP 8DAE142BE17999C4]
Private email: [email protected]
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Special chars with HTMLParser

Reply via email to