I don't think the python regular expression module correctly handles combining marks; it gives inconsistent results between equivalent forms of some regular expressions:
>>> sys.version '2.4.1 (#65, Mar 30 2005, 09:13:57) [MSC v.1310 32 bit (Intel)]' >>>re.match('\w',unicodedata.normalize('NFD',u'\xf1'),re.UNICODE).group(0) u'n' >>>re.match('\w',unicodedata.normalize('NFC',u'\xf1'),re.UNICODE).group(0) u'\xf1' In the above example, u'\xf1' is n-with-tilde (ñ). NFC happens to be a no-op, and NFD decomposes it into u'n\u0303', which splits out the tilde as a combining mark. Is this a limitation-by-design, or a bug? If the latter, is it already known/to-be-fixed? -- http://mail.python.org/mailman/listinfo/python-list