On Tue, Oct 18, 2016 at 2:09 AM, Chris Angelico <ros...@gmail.com> wrote: > That's not a UTF-16 encoded byte string, though. It's a Unicode string > that contains two surrogates. So maybe the solution is to convert from > true Unicode strings into strings like the above - but if so, it > absolutely must not be done in any user-facing way. It should be an > implementation detail of Tkinter.
Yes, it's an invalid Unicode string, since it contains surrogate codes. At the C level this gets passed as a UTF-16 string, even in Unix, i.e. in most cases a Tcl_UniChar is defined as a C unsigned short since the macro TCL_UTF_MAX defaults to 3 (UTF-8 bytes). As I said, I'm not experienced with TCL/Tk enough to know whether UTF-16 strings with surrogate pairs cause other problems. On Linux it prints the surrogate codes as empty box characters, which is certainly ugly and also incorrect to print two characters in place of one. It seems that TCL's UTF-8 conversion doesn't work with UTF-16. Thus supporting non-BMP characters would be limited to Windows until the default TCL_UTF_MAX is greater than 3 on Unix platforms. Supposedly this has actually worked in the core TCL implementation for some time, but extensions are holding it back. -- https://mail.python.org/mailman/listinfo/python-list