>>>>> Fafounet <fafou...@gmail.com> (F) wrote: >F> Thank you, now I can get the correct character. >F> Now when I have the string abécd I can get ab then é thanks to >F> your function and then cd. But how is it possible to know that cd is >F> still the same word ?
That depends on your definition of `word'. And that is language-dependent. What you normally do is collect the text in a (unicode) string variable. This happens in handle_data, handle_charref and handle_entityref. Then you check that the previously collected stuff was a word (e.g. consisting of Unicode letters), and that the new stuff also consists of letters. If your language has additional word constituents like - or ' you have to add this. You can do this with unicodedata.category or with a regular expression. If your locale is correct \w in a regular expression may be helpful. -- Piet van Oostrum <p...@cs.uu.nl> URL: http://pietvanoostrum.com [PGP 8DAE142BE17999C4] Private email: p...@vanoostrum.org -- http://mail.python.org/mailman/listinfo/python-list