> I'm trying to make a unicode friendly regexp to grab sentences > reasonably reliably for as many unicode languages as > possible, focusing on european languages first, hence it'd be > useful to be able to refer to any uppercase unicode character > instead of just the typical [A-Z], which doesn't include, for > example É. Is there a way to do this, or do I have to stick > with using the isupper method of the string class?
Well, assuming you pass in the UNICODE or LOCALE specifier, the following portion of a regexp *should* find what you're describing: ############################################### import re tests = [("1", False), ("a", True), ("Hello", True), ("2bad", False), ("bad1", False), ("a c", False) ] r = re.compile(r'^(?:(?=\w)[^\d_])*$') for test, expected_result in tests: if r.match(test): passed = expected_result else: passed = not expected_result print "[%s] expected [%s] passed [%s]" % ( test, expected_result, passed) ############################################### That looks for a "word" character ("\w") but doesn't swallow it ("(?=...)"), and then asserts that the character is not ("^") a digit ("\d") or an underscore. It looks for any number of "these things" ("(?:...)*"), which you can tweak to your own taste. For Unicode-ification, just pass the re.UNICODE parameter to compile(). Hope this makes sense and helps, -tkc -- http://mail.python.org/mailman/listinfo/python-list