Re: [regex] case-splitting strings in unicode

Martin v. Löwis Sun, 09 Oct 2005 09:40:48 -0700

John Perks and Sarah Mount wrote:
> I have to split some identifiers that are casedLikeThis into their
> component words. In this instance I can safely use [A-Z] to represent
> uppercase, but what pattern should I use if I wanted it to work more
> generally? I can envisage walking the string testing the
> unicodedata.category of each char, but is there a regex'y way to denote
> "uppercase"?


In this form, it is currently not implemented, although it should be
(written as [[:upper:]], I believe); contributions are welcome (make
sure you read the Unicode consortium's guidelines on regular expressions
before attempting to implement it).

Until then, the "best" way is to use a regular character class,
precomputed or computed at runtime.

uni_upper = [unichr(i) for i in range(sys.maxunicode) if 
unichr(i).isupper()]
uni_re = u"["+u"".join(uni_upper)+u"]"

On my machine, this takes approximately one second to compute,
which may or may not be too much as a startup cost. To speed
this up, you could dump the resulting uni_re into a Python
source file.

Regards,
Martin
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: [regex] case-splitting strings in unicode

Reply via email to