Stefan Rank wrote: > <wishful thinking> > > re.compile('|'.join([x.encode('utf8') for x in unicode.uppercase]))
This would (almost) work, but it would be terribly inefficient (time linear to the number of alternatives). You can realistically do uppers = [u'['] for i in range(sys.maxunicode): c = unichr(i) if c.isupper(): uppers.append(c) uppers.append(u']') uppers = u"".join(uppers) uppers_re = re.compile(uppers) Compiling this expression is quite expensive; matching it is fairly efficient (time independent of the number of characters in the class). To save startup cost, consider pickling the compiled expression. (syntax note: this only works because none of the characters special to a RE class (]-^\) is an uppercase letter; otherwise, escaping might be needed) > for the latter two, to work on utf-8 strings, would I have to set the > defaultencoding to utf-8? For Unicode things, you should avoid using byte strings - especially when it comes to regular expressions. Use Unicode strings instead. Regards, Martin -- http://mail.python.org/mailman/listinfo/python-list