MRAB <[EMAIL PROTECTED]> writes: > I'm not sure why the Unicode flag is needed in the API. I reckon > that it should just look at the text that the regular expression is > being applied to: if it's Unicode then follow the Unicode rules, if > not then don't.
It might be that using Unicode tables for lookup of character classes slows things down considerably because the tables are huge. It is useful to be able to treat Unicode strings the same way ASCII strings are treated, but the question is what should be the default. Whitespace is probably not controversial, but many parsers tend to expect things like \d to match [0-9], not any Unicode character marked as "digit". For example, I'm not sure if this behavior would be a good default: >>> re.match(r'\d', u'\u0660', re.UNICODE) <_sre.SRE_Match object at 0xb7da0250> What digit is \u0660, out of 0-9? Hard to say. If re.UNICODE were the default for Unicode strings, code that expected \d to yield an actual digit would have a problem on their hands -- especially so in Python 3 where that would apply to *all* strings. -- http://mail.python.org/mailman/listinfo/python-list