On Sun, 2007-07-22 at 22:33 +0200, Peter Kleiweg wrote: > >>> import re > >>> s = u'a b\u00A0c d' > >>> s.split() > [u'a', u'b', u'c', u'd'] > >>> re.findall(r'\S+', s) > [u'a', u'b\xa0c', u'd'] >
If you want the Unicode interpretation of \S+, etc, you pass the re.UNICODE flag: >>> re.findall(r'\S+', s,re.UNICODE) [u'a', u'b', u'c', u'd'] See http://docs.python.org/lib/node46.html > > This isn't documented either: > > >>> s = ' b c ' > >>> s.split() > ['b', 'c'] > >>> s.split(' ') > ['', 'b', 'c', ''] I believe the following documents it accurately: http://docs.python.org/lib/string-methods.html If sep is not specified or is None, a different splitting algorithm is applied. First, whitespace characters (spaces, tabs, newlines, returns, and formfeeds) are stripped from both ends. Then, words are separated by arbitrary length strings of whitespace characters. Consecutive whitespace delimiters are treated as a single delimiter ("'1 2 3'.split()" returns "['1', '2', '3']"). Splitting an empty string or a string consisting of just whitespace returns an empty list. -- http://mail.python.org/mailman/listinfo/python-list