Re: schizophrenic view of what is white space

MRAB Thu, 04 Dec 2008 12:31:02 -0800

Terry Reedy wrote:

MRAB wrote:
Robin Becker wrote:
Jean-Paul Calderone wrote:
.........
You have to give the re module an additional hint that you care about
unicode:

 [EMAIL PROTECTED]:~$ python
Python 2.5.2 (r252:60911, Jul 31 2008, 17:28:52) [GCC 4.2.3(Ubuntu 4.2.3-2ubuntu7)] on linux2
 Type "help", "copyright", "credits" or "license" for more information.
 >>> import re
 >>> print re.compile(r'\s').search(u'a\xa0b')
 None
 >>> print re.compile(r'\s', re.U).search(u'a\xa0b')
 <_sre.SRE_Match object at 0xb7dbb3a0>
 >>>

Jean-Paul
.......
so the default behaviour differs for unicode and re working onunicode. I suppose that won't be true in Python 3.
 >
I'm not sure why the Unicode flag is needed in the API. I reckon thatit should just look at the text that the regular expression is beingapplied to: if it's Unicode then follow the Unicode rules, if not thendon't.
I presume because \b is interpreted and replaced when the re is compiledinto internal state machine form.

The regular expression is compiled to codes which are then interpreted.There are 2 versions of the matcher, one for bytestrings and another forUnicode. I don't think that having it agnostic is too difficult to achieve.

Interestingly, it treats every bytestring character as just a Unicodecodepoint, so re.match(chr(0x80), unichr(0x80)) succeeds! I suppose itshould complain if only one of the regex and the text is Unicode and theregex contains a literal or a literal character set (if the regex is,say, just \w then it doesn't matter).

--
http://mail.python.org/mailman/listinfo/python-list

Re: schizophrenic view of what is white space

Reply via email to