On Oct 09, John Perks and Sarah Mount wrote: > I have to split some identifiers that are casedLikeThis into their > component words. In this instance I can safely use [A-Z] to represent > uppercase, but what pattern should I use if I wanted it to work more > generally? I can envisage walking the string testing the > unicodedata.category of each char, but is there a regex'y way to > denote "uppercase"?
Not sure what your output should look like but something like this could work: >>> import re >>> re.sub(r'([A-Z])', r' \1', 'theFirstTest theSecondTest') 'the First Test the Second Test' This can be adapted for multiline, etc, but maybe '[A-Z]' is sufficiently general. The regex module does have an understanding of unicode (but I don't, sorry); you could add (?u) make it unicode aware. For programming language identifiers I wouldn't think that unicode should be an issue. Sorry I'm no help with unicode specifics. Some useful links: http://www.python.org/doc/2.4.2/lib/module-re.html http://www.amk.ca/python/howto/regex/regex.html -- Micah Elliott <[EMAIL PROTECTED]> -- http://mail.python.org/mailman/listinfo/python-list