>>>>> Fafounet <fafou...@gmail.com> (F) wrote:

>F> Thank you, now I can get the correct character.
>F> Now when I have the string ab&#xE9;cd I can get ab then é thanks to
>F> your function and then cd. But how is it possible to know that cd is
>F> still the same word ?

That depends on your definition of `word'. And that is
language-dependent. 

What you normally do is collect the text in a (unicode) string variable.
This happens in handle_data, handle_charref and handle_entityref.
Then you check that the previously collected stuff was a word (e.g.
consisting of Unicode letters), and that the new stuff also consists of
letters. If your language has additional word constituents like - or '
you have to add this.

You can do this with unicodedata.category or with a regular
expression. If your locale is correct \w in a regular expression may be
helpful. 
-- 
Piet van Oostrum <p...@cs.uu.nl>
URL: http://pietvanoostrum.com [PGP 8DAE142BE17999C4]
Private email: p...@vanoostrum.org
-- 
http://mail.python.org/mailman/listinfo/python-list

Reply via email to