Unicode regular expressions -- buggy?

Christopher Subich Wed, 10 Aug 2005 23:55:38 -0700

I don't think the python regular expression module correctly handles 
combining marks; it gives inconsistent results between equivalent forms 
of some regular expressions:


 >>> sys.version
'2.4.1 (#65, Mar 30 2005, 09:13:57) [MSC v.1310 32 bit (Intel)]'
 >>>re.match('\w',unicodedata.normalize('NFD',u'\xf1'),re.UNICODE).group(0)
u'n'
 >>>re.match('\w',unicodedata.normalize('NFC',u'\xf1'),re.UNICODE).group(0)
u'\xf1'

In the above example, u'\xf1' is n-with-tilde (ñ).  NFC happens to be a 
no-op, and NFD decomposes it into u'n\u0303', which splits out the tilde 
as a combining mark.

Is this a limitation-by-design, or a bug?  If the latter, is it already 
known/to-be-fixed?
-- 
http://mail.python.org/mailman/listinfo/python-list

Unicode regular expressions -- buggy?

Reply via email to