New submission from Steve Moran <s...@uw.edu>: The regex package doesn't seem to correctly implement the single grapheme match "\X" (\P{M}\p{M}*) for pre-Python 3. I'm using the string "íi-te" (i, U+0301, i, -, t, e -- where U+0301 is Unicode COMBINING ACUTE ACCENT), reading it in from a file to bypass Unicode c&p issues in the older IDLEs).
s...@x$ python3.1 Python 3.1.2 (r312:79147, May 19 2010, 11:50:28) [GCC 4.1.2 20080704 (Red Hat 4.1.2-46)] on linux2 Type "help", "copyright", "credits" or "license" for more information. >>> import regex >>> file = open("test_data", "rt", encoding="utf-8") >>> s = file.readline() >>> print (s) íi-te >>> print (g.findall(s)) ['í', 'i', '-', 't', 'e'] * Correct in 3.1 - i+U+0301 considered one grapheme. s...@x$ python2.7 Python 2.7 (r27:82500, Oct 4 2010, 14:49:53) [GCC 4.1.2 20080704 (Red Hat 4.1.2-48)] on linux2 Type "help", "copyright", "credits" or "license" for more information. >>> import codecs >>> import regex >>> file = codecs.open("test_data", "r", "utf-8") >>> g = regex.compile("\X") >>> s = file.readline() >>> s u'i\u0301i-te' >>> print s.encode("utf-8") íi-te >>> print g.findall(s) [u'i', u'\u0301', u'i', u'-', u't', u'e'] *Not correct -- accent is treated as a separate character. Thanks. ---------- components: Regular Expressions messages: 123961 nosy: stiv priority: normal severity: normal status: open title: Regex 0.1.20101210 type: behavior versions: Python 2.7 _______________________________________ Python tracker <rep...@bugs.python.org> <http://bugs.python.org/issue10703> _______________________________________ _______________________________________________ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com