Re: regexps with unicode-aware characterclasses?

Martin v. Löwis Tue, 13 Sep 2005 23:20:46 -0700

Stefan Rank wrote:
> <wishful thinking>
> 
>   re.compile('|'.join([x.encode('utf8') for x in unicode.uppercase]))


This would (almost) work, but it would be terribly inefficient (time
linear to the number of alternatives). You can realistically do

uppers = [u'[']
for i in range(sys.maxunicode):
  c = unichr(i)
  if c.isupper(): uppers.append(c)
uppers.append(u']')
uppers = u"".join(uppers)
uppers_re = re.compile(uppers)

Compiling this expression is quite expensive; matching it is fairly
efficient (time independent of the number of characters in the class).
To save startup cost, consider pickling the compiled expression.

(syntax note: this only works because none of the characters special
to a RE class (]-^\) is an uppercase letter; otherwise, escaping might
be needed)

> for the latter two, to work on utf-8 strings, would I have to set the
> defaultencoding to utf-8?

For Unicode things, you should avoid using byte strings - especially
when it comes to regular expressions. Use Unicode strings instead.

Regards,
Martin
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: regexps with unicode-aware characterclasses?

Reply via email to