New submission from Tom Christiansen <tchr...@perl.com>: Python is in flagrant violation of the very most basic premises of Unicode Technical Report #18 on Regular Expressions, which requires that a regex engine support Unicode characters as "basic logical units independent of serialization like UTFβ*". Because sometimes you must specify ".." to match a single Unicode character -- whenever those code points are above the BMP and you are on a narrow build -- Python regexes cannot be reliably used for Unicode text.
% python3.2 Python 3.2 (r32:88445, Jul 21 2011, 14:44:19) [GCC 4.2.1 (Apple Inc. build 5664)] on darwin Type "help", "copyright", "credits" or "license" for more information. >>> import re >>> g = "\N{GREEK SMALL LETTER ALPHA WITH VARIA AND YPOGEGRAMMENI}" >>> print(g) αΎ² >>> print(re.search(r'\w', g)) <_sre.SRE_Match object at 0x10051f988> >>> p = "\N{MATHEMATICAL SCRIPT CAPITAL P}" >>> print(p) π« >>> print(re.search(r'\w', p)) None >>> print(re.search(r'..', p)) # β ππππΒ ππΒ πππΒ πππππΌππππΒ πππππΒ ππππ <_sre.SRE_Match object at 0x10051f988> >>> print(len(chr(0x1D4AB))) 2 That is illegal in Unicode regular expressions. ---------- components: Regular Expressions messages: 141917 nosy: tchrist priority: normal severity: normal status: open title: Python lib re cannot handle Unicode properly due to narrow/wide bug type: behavior versions: Python 2.7 _______________________________________ Python tracker <rep...@bugs.python.org> <http://bugs.python.org/issue12729> _______________________________________ _______________________________________________ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com