Bugs item #1251300, was opened at 2005-08-03 21:49 Message generated for change (Comment added) made by nhaldimann You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=105470&aid=1251300&group_id=5470
Please note that this message will contain a full copy of the comment thread, including the initial issue submission, for this request, not just the latest update. Category: Unicode Group: Python 2.5 Status: Open Resolution: None Priority: 5 Submitted By: Nik Haldimann (nhaldimann) Assigned to: M.-A. Lemburg (lemburg) Summary: Decoding with unicode_internal segfaults on UCS-4 builds Initial Comment: On UCS-4 builds, decoding a byte string with the unicode_internal codec doesn't correctly work for code points from 0x80000000 upwards and even segfaults. I have observed the same behaviour on 2.5 from CVS and 2.4.0 on OS X/PowerPC as well as on 2.3.5 on Linux/x86. Here's an example: Python 2.5a0 (#1, Aug 3 2005, 21:34:05) [GCC 3.3 20030304 (Apple Computer, Inc. build 1671)] on darwin Type "help", "copyright", "credits" or "license" for more information. >>> "\x7f\xff\xff\xff".decode("unicode_internal") u'\U7fffffff' >>> "\x80\x00\x00\x00".decode("unicode_internal") u'\x00' >>> "\x80\x00\x00\x01".decode("unicode_internal") u'\x01' >>> "\x81\x00\x00\x00".decode("unicode_internal") Segmentation fault On little endian architectures the byte strings must be reversed for the same effect. I'm not sure if I understand what's going on, but I guess there are 2 solution strategies: 1. Make unicode_internal work for any code point up to 0xFFFFFFFF. 2. Make unicode_internal raise a UnicodeDecodeError for anything above 0x10FFFF (== sys.maxunicode for UCS-4 builds). It seems like there are no unicode code points above 0x10FFFF, so the latter solution feels more correct to me, even though it might break backwards compatibility a tiny bit. The unicodeescape codec already does a similar thing: >>> u"\U00110000" UnicodeDecodeError: 'unicodeescape' codec can't decode bytes in position 0-9: illegal Unicode character ---------------------------------------------------------------------- >Comment By: Nik Haldimann (nhaldimann) Date: 2005-08-19 16:17 Message: Logged In: YES user_id=1317086 I agree about the ifdefs. I'm not sure about how to handle input strings of incorrect length. I guess raising an UnicodeDecodeError is in order. But I think it doesn't make sense to let it pass through the error handler, since the data the handler would see is potentially nonsensical (e.g., the code point value). Can you comment on this? Is it ok to raise a UnicodeDecodeError and skip the error handler here? ---------------------------------------------------------------------- Comment By: Walter Dörwald (doerwalter) Date: 2005-08-18 22:17 Message: Logged In: YES user_id=89016 The patch has a problem with input strings of a length that is not a multiple of 4, e.g. "\x00".decode("unicode-internal") returns u"" instead of raising an error. Also in a UCS-2 build most of the tests are irrelevant (as it's not possible to create codepoints above 0x10ffff even when using surrogates), so probably they should be ifdef'd out. ---------------------------------------------------------------------- Comment By: Nik Haldimann (nhaldimann) Date: 2005-08-05 23:08 Message: Logged In: YES user_id=1317086 Here's the patch with error handler support + test. Again: Please review carefully. ---------------------------------------------------------------------- Comment By: Nik Haldimann (nhaldimann) Date: 2005-08-05 18:35 Message: Logged In: YES user_id=1317086 Ah, that PEP clears some things up for me. I will look into it, but I hope you realize this requires tinkering with unicodeobject.c since the error handler code seems to live there. ---------------------------------------------------------------------- Comment By: Walter Dörwald (doerwalter) Date: 2005-08-05 18:03 Message: Logged In: YES user_id=89016 Your patch doesn't support PEP 293 error handlers. Could you add support for that? ---------------------------------------------------------------------- Comment By: Nik Haldimann (nhaldimann) Date: 2005-08-05 16:50 Message: Logged In: YES user_id=1317086 OK, I put something together. Please review carefully as I'm not very familiar with the C API. I have tested this with the CVS HEAD on OS X and Linux. ---------------------------------------------------------------------- Comment By: M.-A. Lemburg (lemburg) Date: 2005-08-04 16:41 Message: Logged In: YES user_id=38388 I think solution 2 is the right approach, since UCS-4 only has 0x10FFFF possible code points. Could you provide a patch ? ---------------------------------------------------------------------- You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=105470&aid=1251300&group_id=5470 _______________________________________________ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com