[ python-Bugs-1390608 ] split() breaks no-break spaces
Bugs item #1390608, was opened at 2005-12-26 16:03 Message generated for change (Comment added) made by doerwalter You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=105470&aid=1390608&group_id=5470 Please note that this message will contain a full copy of the comment thread, including the initial issue submission, for this request, not just the latest update. Category: Python Library Group: Python 2.4 Status: Open Resolution: None Priority: 5 Submitted By: MvR (maxim_razin) Assigned to: Nobody/Anonymous (nobody) Summary: split() breaks no-break spaces Initial Comment: string.split(), str.split() and unicode.split() without parameters break strings by the No-break space (U+00A0) character. This character is specially intended not to be a split border. >>> u"Hello\u00A0world".split() [u'Hello', u'world'] -- >Comment By: Walter Dörwald (doerwalter) Date: 2005-12-30 13:35 Message: Logged In: YES user_id=89016 What's wrong with the following? import sys, unicodedata spaces = u"".join(unichr(c) for c in xrange(0, sys.maxunicode) if unicodedata.category(unichr(c))=="Zs" and c != 160) foo.split(spaces) -- Comment By: Hye-Shik Chang (perky) Date: 2005-12-30 01:30 Message: Logged In: YES user_id=55188 Python documentation says that it splits in "whitespace characters" not "breaking characters". So, current behavior is correct according to the documentation. And even rationale among string methods are heavily depends on ctype functions on libc. Therefore, we can't serve special treatment for the NBSP. However, I feel the need for the splitting function that awares what character is breaking or not. How about to add it as unicodedata.split()? -- Comment By: Fredrik Lundh (effbot) Date: 2005-12-29 21:42 Message: Logged In: YES user_id=38376 split isn't a word-wrapping split, so I'm not sure that's the right place to fix this. ("no-break space" is white- space, according to the Unicode standard, and split breaks on whitespace). -- You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=105470&aid=1390608&group_id=5470 ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[ python-Bugs-1390608 ] split() breaks no-break spaces
Bugs item #1390608, was opened at 2005-12-26 16:03 Message generated for change (Comment added) made by lemburg You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=105470&aid=1390608&group_id=5470 Please note that this message will contain a full copy of the comment thread, including the initial issue submission, for this request, not just the latest update. Category: Python Library Group: Python 2.4 >Status: Closed >Resolution: Wont Fix Priority: 5 Submitted By: MvR (maxim_razin) >Assigned to: M.-A. Lemburg (lemburg) Summary: split() breaks no-break spaces Initial Comment: string.split(), str.split() and unicode.split() without parameters break strings by the No-break space (U+00A0) character. This character is specially intended not to be a split border. >>> u"Hello\u00A0world".split() [u'Hello', u'world'] -- >Comment By: M.-A. Lemburg (lemburg) Date: 2005-12-30 14:06 Message: Logged In: YES user_id=38388 Maxim, you are right that \xA0 is a non-break space. However, like the others already mentioned, the .split() method defaults to breaking a string on whitespace characters, not breakable whitespace characters. The intent is not a typographical one, but originates from the desire to quickly tokenize a string. If you'd rather like to see a different set of whitespace characters used, you can pass such a template string to the .split() method (Walter gave an example). Closing this as "Won't fix". -- Comment By: Walter Dörwald (doerwalter) Date: 2005-12-30 13:35 Message: Logged In: YES user_id=89016 What's wrong with the following? import sys, unicodedata spaces = u"".join(unichr(c) for c in xrange(0, sys.maxunicode) if unicodedata.category(unichr(c))=="Zs" and c != 160) foo.split(spaces) -- Comment By: Hye-Shik Chang (perky) Date: 2005-12-30 01:30 Message: Logged In: YES user_id=55188 Python documentation says that it splits in "whitespace characters" not "breaking characters". So, current behavior is correct according to the documentation. And even rationale among string methods are heavily depends on ctype functions on libc. Therefore, we can't serve special treatment for the NBSP. However, I feel the need for the splitting function that awares what character is breaking or not. How about to add it as unicodedata.split()? -- Comment By: Fredrik Lundh (effbot) Date: 2005-12-29 21:42 Message: Logged In: YES user_id=38376 split isn't a word-wrapping split, so I'm not sure that's the right place to fix this. ("no-break space" is white- space, according to the Unicode standard, and split breaks on whitespace). -- You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=105470&aid=1390608&group_id=5470 ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[ python-Bugs-1349732 ] urllib.urlencode provides two features in one param
Bugs item #1349732, was opened at 2005-11-06 23:58 Message generated for change (Comment added) made by salty-horse You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=105470&aid=1349732&group_id=5470 Please note that this message will contain a full copy of the comment thread, including the initial issue submission, for this request, not just the latest update. Category: Python Library Group: Python 2.4 Status: Open Resolution: None Priority: 5 Submitted By: Ori Avtalion (salty-horse) Assigned to: Nobody/Anonymous (nobody) Summary: urllib.urlencode provides two features in one param Initial Comment: Using the 2.4 distribution. It seems that urlencode knows how to handle unicode input with quote_plus and ascii encoding, but it only does that when doseq is True. 1) There's no mention of that useful feature in the documentation. 2) If I want to encode unicode data without doseq's feature, there's no way to do so. Although it's rare to use doseq's intended function, they shouldn't be connected. Shouldn't values be checked with _is_unicode and handled correctly in both modes of doseq? One reason I see that *might* make the unicode check cause problems is the comment says "preserve old behavior" when doseq is False. Could such a check affect the behaviour of old code? If it can, the unicode handling could be another optional parameter. Also, the docstring is really unclear as to the purpose of doseq. Can an small example be added? (I saw no PEP guidelines for how examples should be given in docstrings, or if they're even allowed, so perhaps this fits just the regular documentation) With query={"key": ("val1", "val2") doseq=1 yields: key=val1&key=val2 doseq=0 yields: key=%28%27val1%27%2C+%27val2%27%29 After the correct solution is settled, I'll gladly submit a patch with the fixes. -- >Comment By: Ori Avtalion (salty-horse) Date: 2005-12-30 18:10 Message: Logged In: YES user_id=854801 > However, I was unable to reproduce your observation that > doseq=0 results in urlencode not knowing how to handle > unicode. I had given urlencode a hebrew unicode string, and "".encode() could not convert it to ascii: s_unicode = u'\u05d1\u05d3\u05d9\u05e7\u05d4' print urllib.urlencode({"key":s_unicode}, 0) As I notice now, the line: >> urllib.urlencode({"key":s_unicode}, 1) key=%3F%3F%3F%3F%3F does not raise an exception but produces an incorrect result. The correct way to call it is like this: >> urllib.urlencode({"key":s_unicode.encode("iso8859_8")}, 1) key=%E1%E3%E9%F7%E4 So, in addition to your suggestion, I think the documentation should explicitly state that unicode strings will be treated as us-ascii. What about my suggestion of an example for doseq's behaviour in the docstring? -- Comment By: Mike Brown (mike_j_brown) Date: 2005-12-30 01:32 Message: Logged In: YES user_id=371366 I understand why the implementation is the way it is. I agree that it is not documented as ideally as it could be. I also agree with your implication that ASCII-range unicode input should be acceptable (and converted to ASCII bytes internally before percent-encoding), regardless of doseq. I would not go so far as to say non-ASCII-range unicode should be accepted, since safe conversion to bytes before percent-encoding would not be possible. However, I was unable to reproduce your observation that doseq=0 results in urlencode not knowing how to handle unicode. The object is just passed to str(). Granted, that's not *quite* the same as when doseq=1, where unicode objects are specifically run through .encode('us-ascii','replace')), but I wouldn't characterize it as not knowing how to handle ASCII-range unicode. The results for ASCII-range unicode are the same. If you're going to make things more consistent, I would actually tighten up the doseq=1 behavior, replacing v = quote_plus(v.encode("ASCII","replace")) with v = quote_plus(v.encode("ASCII","strict")) and then mention in the docs that any object type is acceptable as a key or value, but if unicode is passed, it must be all ASCII-range characters; if there is a risk of characters above \u007f being passed, then the caller should convert the unicode to str beforehand. As for doseq's purpose and documentation, the doseq=1 behavior is ideal for almost all situations, since it takes care not to treat str or unicode as a sequence of separate 1-character values. AFAIK, the only reason it isn't the default is for backward compatiblity. It was introduced in Python 2.0.1 and was trying to retain compatibility with code written for Python 1.5.2 through 2.0.0. I suggest deprecating it and making doseq=1 behavior the default, if others (MvL?) approve. -- You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=105470&aid=13
[ python-Bugs-1394135 ] Deleting first item causes anydbm.first() to fail
Bugs item #1394135, was opened at 2005-12-30 20:24 Message generated for change (Tracker Item Submitted) made by Item Submitter You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=105470&aid=1394135&group_id=5470 Please note that this message will contain a full copy of the comment thread, including the initial issue submission, for this request, not just the latest update. Category: Python Library Group: Python 2.4 Status: Open Resolution: None Priority: 5 Submitted By: Dan Bisalputra (danbiz) Assigned to: Nobody/Anonymous (nobody) Summary: Deleting first item causes anydbm.first() to fail Initial Comment: If the first item in a database is deleted, the first call to anydbm.first() after the deletion causes a DBNotFoundError exception to be raised. The attached program causes the error on my system. I am currently working around the problem by calling first() after each deletion, enclosed by a try block. I am using Python 2.4.2 running under Windows ME. -- You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=105470&aid=1394135&group_id=5470 ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com