Marc-Andre Lemburg <[EMAIL PROTECTED]> added the comment: Adam, I do know what I'm talking about: I was the lead designer of the Unicode integration you find in Python and implemented most of it.
What you see as repr() of a Unicode object is the result of applying a codec to the internal representation. Please don't confuse the output of the codec ("unicode-escape") with the internal representation. That said, Ezio did uncover a bug and we need to find the cause. It's likely caused by the fact that the UTF-8 codec does not recombine surrogates on UCS4 builds. See this comment in the codec implementation: case 3: if ((s[1] & 0xc0) != 0x80 || (s[2] & 0xc0) != 0x80) { errmsg = "invalid data"; startinpos = s-starts; endinpos = startinpos+3; goto utf8Error; } ch = ((s[0] & 0x0f) << 12) + ((s[1] & 0x3f) << 6) + (s[2] & 0x3f); if (ch < 0x0800) { /* Note: UTF-8 encodings of surrogates are considered legal UTF-8 sequences; XXX For wide builds (UCS-4) we should probably try to recombine the surrogates into a single code unit. */ errmsg = "illegal encoding"; startinpos = s-starts; endinpos = startinpos+3; goto utf8Error; } else *p++ = (Py_UNICODE)ch; break; _______________________________________ Python tracker <[EMAIL PROTECTED]> <http://bugs.python.org/issue3297> _______________________________________ _______________________________________________ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com