New submission from Tom Christiansen <tchr...@perl.com>:

The Python re library is broken in its approach to case-insensitive matches. It 
erroneously attempts to compare lowercase mappings.  This is wrong. You must 
compare the Unicode casefolds, not the Unicode casemaps. Otherwise you get 
wrong answers.  I include a small test case that illustrates this bug.  The bug 
exists on both 2.7 and 3.2, and on both wide builds and narrow builds.  For 
comparison, I also show results using Matthew Barnett's regex library, which 
gets all 5 tests correct where re gets all 5 tests wrong.

A sample run is:

FAIL: re    pattern Ι is    not the same as string ͅ
PASS: regex pattern Ι is indeed the same as string ͅ
FAIL: re    pattern Μ is    not the same as string µ
PASS: regex pattern Μ is indeed the same as string µ
FAIL: re    pattern ſ is    not the same as string s
PASS: regex pattern ſ is indeed the same as string s
FAIL: re    pattern ΣΤΙΓΜΑΣ is    not the same as string στιγμας
PASS: regex pattern ΣΤΙΓΜΑΣ is indeed the same as string στιγμας
FAIL: re    pattern POST is    not the same as string poſt
PASS: regex pattern POST is indeed the same as string poſt

re    lib passed 0 of 5 tests
regex lib passed 5 of 5 tests

----------
components: Library (Lib)
files: sigmata.python
messages: 141916
nosy: tchrist
priority: normal
severity: normal
status: open
title: Python re lib fails case insensitive matches on Unicode data
versions: Python 2.7
Added file: http://bugs.python.org/file22879/sigmata.python

_______________________________________
Python tracker <rep...@bugs.python.org>
<http://bugs.python.org/issue12728>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

Reply via email to