Re: [regex] case-splitting strings in unicode

Micah Elliott Sat, 08 Oct 2005 17:56:47 -0700

On Oct 09, John Perks and Sarah Mount wrote:
> I have to split some identifiers that are casedLikeThis into their
> component words. In this instance I can safely use [A-Z] to represent
> uppercase, but what pattern should I use if I wanted it to work more
> generally? I can envisage walking the string testing the
> unicodedata.category of each char, but is there a regex'y way to
> denote "uppercase"?


Not sure what your output should look like but something like this could
work:

>>> import re
>>> re.sub(r'([A-Z])', r' \1', 'theFirstTest theSecondTest')
'the First Test the Second Test'

This can be adapted for multiline, etc, but maybe '[A-Z]' is
sufficiently general.  The regex module does have an understanding of
unicode (but I don't, sorry); you could add (?u) make it unicode aware.
For programming language identifiers I wouldn't think that unicode
should be an issue.  Sorry I'm no help with unicode specifics.

Some useful links:

http://www.python.org/doc/2.4.2/lib/module-re.html
http://www.amk.ca/python/howto/regex/regex.html

-- 
Micah Elliott
<[EMAIL PROTECTED]>
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: [regex] case-splitting strings in unicode

Reply via email to