Bugs item #1390608, was opened at 2005-12-26 16:03 Message generated for change (Comment added) made by lemburg You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=105470&aid=1390608&group_id=5470
Please note that this message will contain a full copy of the comment thread, including the initial issue submission, for this request, not just the latest update. Category: Python Library Group: Python 2.4 Status: Closed Resolution: Wont Fix Priority: 5 Submitted By: MvR (maxim_razin) Assigned to: M.-A. Lemburg (lemburg) Summary: split() breaks no-break spaces Initial Comment: string.split(), str.split() and unicode.split() without parameters break strings by the No-break space (U+00A0) character. This character is specially intended not to be a split border. >>> u"Hello\u00A0world".split() [u'Hello', u'world'] ---------------------------------------------------------------------- >Comment By: M.-A. Lemburg (lemburg) Date: 2006-01-03 12:07 Message: Logged In: YES user_id=38388 No. These things are application scope details and should thus be implemented in the application rather than as method on an object. The methods always work on whitespace and that's clearly defined. ---------------------------------------------------------------------- Comment By: Walter Dörwald (doerwalter) Date: 2006-01-03 11:33 Message: Logged In: YES user_id=89016 Seems I confused strip() with split(). I *did* try that work around, and it did what I expected: It *didn't* split on U+00A0 ;) If we want to fix this discrepancy, we could add methods stripchars(), (as a synonym for strip()) and stripstring(), as well as splitchars() and splitstring() (as a synonym for split()). ---------------------------------------------------------------------- Comment By: M.-A. Lemburg (lemburg) Date: 2006-01-02 12:13 Message: Logged In: YES user_id=38388 Oops. You're right, Sjoerd. Still, you could achieve the splitting by using a re-expression that is build from the set of characters fetched from the Unicode database and then using the .split() method of the re object. ---------------------------------------------------------------------- Comment By: Sjoerd Mullender (sjoerd) Date: 2006-01-02 11:48 Message: Logged In: YES user_id=43607 Walter and MAL, did you actually try that work around? It doesn't work: >>> import sys, unicodedata >>> spaces = u"".join(unichr(c) for c in xrange(0, sys.maxunicode) if unicodedata.category(unichr(c))=="Zs" and c != 160) >>> foo = u"Hello\u00A0world" >>> foo.split(spaces) [u'Hello\xa0world'] That's because split() takes the whole separator argument as separator, not any of the characters in it. ---------------------------------------------------------------------- Comment By: M.-A. Lemburg (lemburg) Date: 2005-12-30 14:06 Message: Logged In: YES user_id=38388 Maxim, you are right that \xA0 is a non-break space. However, like the others already mentioned, the .split() method defaults to breaking a string on whitespace characters, not breakable whitespace characters. The intent is not a typographical one, but originates from the desire to quickly tokenize a string. If you'd rather like to see a different set of whitespace characters used, you can pass such a template string to the .split() method (Walter gave an example). Closing this as "Won't fix". ---------------------------------------------------------------------- Comment By: Walter Dörwald (doerwalter) Date: 2005-12-30 13:35 Message: Logged In: YES user_id=89016 What's wrong with the following? import sys, unicodedata spaces = u"".join(unichr(c) for c in xrange(0, sys.maxunicode) if unicodedata.category(unichr(c))=="Zs" and c != 160) foo.split(spaces) ---------------------------------------------------------------------- Comment By: Hye-Shik Chang (perky) Date: 2005-12-30 01:30 Message: Logged In: YES user_id=55188 Python documentation says that it splits in "whitespace characters" not "breaking characters". So, current behavior is correct according to the documentation. And even rationale among string methods are heavily depends on ctype functions on libc. Therefore, we can't serve special treatment for the NBSP. However, I feel the need for the splitting function that awares what character is breaking or not. How about to add it as unicodedata.split()? ---------------------------------------------------------------------- Comment By: Fredrik Lundh (effbot) Date: 2005-12-29 21:42 Message: Logged In: YES user_id=38376 split isn't a word-wrapping split, so I'm not sure that's the right place to fix this. ("no-break space" is white- space, according to the Unicode standard, and split breaks on whitespace). ---------------------------------------------------------------------- You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=105470&aid=1390608&group_id=5470 _______________________________________________ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com