New submission from Tom Christiansen <tchr...@perl.com>: The Python re library is broken in its approach to case-insensitive matches. It erroneously attempts to compare lowercase mappings. This is wrong. You must compare the Unicode casefolds, not the Unicode casemaps. Otherwise you get wrong answers. I include a small test case that illustrates this bug. The bug exists on both 2.7 and 3.2, and on both wide builds and narrow builds. For comparison, I also show results using Matthew Barnett's regex library, which gets all 5 tests correct where re gets all 5 tests wrong.
A sample run is: FAIL: re pattern Ι is not the same as string ͅ PASS: regex pattern Ι is indeed the same as string ͅ FAIL: re pattern Μ is not the same as string µ PASS: regex pattern Μ is indeed the same as string µ FAIL: re pattern ſ is not the same as string s PASS: regex pattern ſ is indeed the same as string s FAIL: re pattern ΣΤΙΓΜΑΣ is not the same as string στιγμας PASS: regex pattern ΣΤΙΓΜΑΣ is indeed the same as string στιγμας FAIL: re pattern POST is not the same as string poſt PASS: regex pattern POST is indeed the same as string poſt re lib passed 0 of 5 tests regex lib passed 5 of 5 tests ---------- components: Library (Lib) files: sigmata.python messages: 141916 nosy: tchrist priority: normal severity: normal status: open title: Python re lib fails case insensitive matches on Unicode data versions: Python 2.7 Added file: http://bugs.python.org/file22879/sigmata.python _______________________________________ Python tracker <rep...@bugs.python.org> <http://bugs.python.org/issue12728> _______________________________________ _______________________________________________ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com