New submission from Serhiy Storchaka:

Current implementation of re.LOCALE support for Unicode strings is nonsensical. 
It correctly works only on Latin1 locales (because Unicode string interpreted 
as Latin1 decoded bytes string. all characters outside UCS1 range considered as 
non-words), on other locales it got strange and useless results.

>>> import re, locale
>>> locale.setlocale(locale.LC_CTYPE, 'ru_RU.cp1251')
'ru_RU.cp1251'
>>> re.match(br'\w', 'µ'.encode('cp1251'), re.L)
<_sre.SRE_Match object; span=(0, 1), match=b'\xb5'>
>>> re.match(r'\w', 'µ', re.L)
<_sre.SRE_Match object; span=(0, 1), match='µ'>
>>> re.match(br'\w', 'ё'.encode('cp1251'), re.L)
<_sre.SRE_Match object; span=(0, 1), match=b'\xb8'>
>>> re.match(r'\w', 'ё', re.L)

Proposed patch fixes re.LOCALE support for Unicode strings. It uses the 
wide-character equivalents of C characters functions (towlower(), iswalpha(), 
etc).

The problem is that these functions are not exists in C89, they are introduced 
only in C99. Gcc understand them, we should check other compilers. However 
these functions are already used on FreeBSD and MacOS.

----------
components: Extension Modules, Library (Lib), Regular Expressions
files: re_unicode_locale.patch
keywords: patch
messages: 226871
nosy: ezio.melotti, mrabarnett, pitrou, serhiy.storchaka
priority: normal
severity: normal
stage: patch review
status: open
title: re.LOCALE is nonsensical for Unicode
type: behavior
versions: Python 2.7, Python 3.4, Python 3.5
Added file: http://bugs.python.org/file36615/re_unicode_locale.patch

_______________________________________
Python tracker <rep...@bugs.python.org>
<http://bugs.python.org/issue22407>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

Reply via email to