Re: regular expressions and the LOCALE flag

2010-08-03 Thread MRAB
Baz Walter wrote: > On 03/08/10 21:24, MRAB wrote: And, BTW, none of your examples pass a UTF-8 bytestring to re.findall: all those string literals starting with the 'u' prefix are Unicode strings! >>> >>> not sure what you mean by this: if the string was encoded as utf8, >>> '\w' s

Re: regular expressions and the LOCALE flag

2010-08-03 Thread Baz Walter
On 03/08/10 21:24, MRAB wrote: And, BTW, none of your examples pass a UTF-8 bytestring to re.findall: all those string literals starting with the 'u' prefix are Unicode strings! not sure what you mean by this: if the string was encoded as utf8, '\w' still wouldn't match any of the non-ascii cha

Re: regular expressions and the LOCALE flag

2010-08-03 Thread MRAB
Baz Walter wrote: On 03/08/10 19:40, MRAB wrote: Baz Walter wrote: the python docs say that re.LOCALE makes certain character classes "dependent on the current locale". re.LOCALE just passes the character to the underlying C library. It really only works on bytestrings which have 1 byte per c

Re: regular expressions and the LOCALE flag

2010-08-03 Thread Baz Walter
On 03/08/10 19:40, MRAB wrote: Baz Walter wrote: the python docs say that re.LOCALE makes certain character classes "dependent on the current locale". re.LOCALE just passes the character to the underlying C library. It really only works on bytestrings which have 1 byte per character. the re

Re: regular expressions and the LOCALE flag

2010-08-03 Thread MRAB
Baz Walter wrote: the python docs say that re.LOCALE makes certain character classes "dependent on the current locale". here's what i currently see on my system: >>> import re, locale >>> locale.getdefaultlocale() ('en_GB', 'UTF8') >>> locale.getlocale() (None, None) >>> re.findall(r'\w',

regular expressions and the LOCALE flag

2010-08-03 Thread Baz Walter
the python docs say that re.LOCALE makes certain character classes "dependent on the current locale". here's what i currently see on my system: >>> import re, locale >>> locale.getdefaultlocale() ('en_GB', 'UTF8') >>> locale.getlocale() (None, None) >>> re.findall(r'\w', u'a b c \xe5 \xe6 \xe7'