John Machin wrote:
John, nothing I wrote was directed at you. If you feel insulted, you
have my apology. My intention was and is to get future movement on an
issue that was reported 20 months ago but which has lain dead since,
until re-reported (a bit more clearly) a week ago, because of a
misunderstanding by the person who (I believe) rewrote re for unicode
several years ago.
Like this:
| >>> w1 = u"L\N{LATIN SMALL LETTER O WITH DIAERESIS}wis"
| >>> w2 = u"Lo\N{COMBINING DIAERESIS}wis"
| >>> w1
| u'L\xf6wis'
| >>> w2
| u'Lo\u0308wis'
| >>> import unicodedats as ucd
| >>> ucd.category(u'\u0308')
| 'Mn'
| >>> u'\u0308'.isalpha()
| False
| >>> regex = re.compile(ur'\w+', re.UNICODE)
| >>> regex.match(w1).group(0)
| u'L\xf6wis'
| >>> regex.match(w2).group(0)
| u'Lo'
Yes, thank you. FWIW, that confirms my suspicion.
Terry
--
http://mail.python.org/mailman/listinfo/python-list