Terry J. Reedy <tjre...@udel.edu> added the comment:

Recap: IDLE 3.x on Windows exits with UnicodeDecodeError when pasting into 
editor, grep, or shell window a non-BMP astral character such as
𐒢 '\U000104a2', 𝐇, 🐍 '\U0001F40D', or 
🐱 '\U0001F431' UTF-8 b'\xf0\x9f\x90\xb1', UTF-16LI b'\x3d\xd8\x31\xdc'.  
Display issues are not directly of this issue.

The exact error message has varied with the python version, but all likely 
result from the same error.

3.2 msg145581: traceback PyShell.main(), root.mainloop(), tk,mainloop().

  'utf8' codec can't decode bytes in position 1-2: invalid continuation byte

3.3 msg177750: traceback starts with two calls in new runpy module.
  'utf-8' codec can't decode bytes in position 0-2: invalid continuation byte

3.6 to now: same traceback
  'utf-8' codec can't decode byte 0xed in position 0: invalid continuation byte

The initial byte is 0xed regardless of which astral char above is pasted.  Tal, 
if the problem were utf-8 decoding uft-16le bytes, the initial byte in the 
error message for astral chars would (usually) be 0xd8, and there would be 
problems with BMP chars also.

In msg145584, I speculated that the problem might be trying to decode a now 
illegal utf-8 encoding of a surrogate character.  In msg145605, Ezio said that 
the first surrogate would be '\ud801' and showed that the 2.7 utf-8 'encoding' 
of that is b'\xed\xa0\x81' and that trying to decode that give the 3.2 error 
above, but with '0-1' instead of '1-2'.  (0xed is the utf-8 start byte for any 
BMP char and continuation bytes that map to the surrogate blocks, and some 
others, are now invalid.)  Today, b'\xed\xa0\x81'.decode('utf-8') gives exactly 
the current message above.

In msg254165, I noted that pasting copied astral chars into a plain Text widget 
works in the sense that there is no error.  (For me, 𐒢 is replaced by two 
replacement chars and the others are shown without colors, but this depends on 
OS and font.) I just verified this same for Entry widgets in IDLE dialogs and 
the Font settings sample text.  As Serhiy said in msg254165, Left x 2 is needed 
to move back past the char and Backspace x 2 to delete it.  (For me, only 1 
Right is needed to move forward past the char.)  But Serhiy also showed that 
once an astral char *is* displayed, it cannot be properly retrieved.

So the question is, if windows puts utf-16le surrogates on the clipboard, and 
they can be pasted and displayed some in a Text, why is something trying to 
utf-8 decode the utf-8 encoding of each surrogate when pasting into IDLE's 
augmented text?

In msg207381, Serhiy claimed "The root of issue is in converting strings when 
passed to Python-implemented callbacks. When a text is pasted in IDLE window, 
the callback is called (for highlighting). ...".  He goes on to explain that 
tcl *does* encode surrogates to modified utf-8 before passing to them to 
callbacks and claimed that tkinter_pythoncmd_args_2.patch should fix this.

Disabling Colorizer is not enough to allow astral pasting.  See PR 16365. 
Whatever Serhiy's patch did 5 years ago, my copy does not work now.  See PR 
16365. 

Tal, we augment the x11 paste callback in pyshell.fix_x11_paste.  There is no 
unittest and we would have to not break this with further change.

I have thought about replacing the paste callback with clipboard_get, but 
worried that we might not be able to replicate what the system-specific 
tcl/tk/C code does.  That sometimes includes displaying the actual astral 
character. I presume that tcl just passes the clipboard bytes to the graphics 
system, which we cannot do from python.

Anyway, you have shown that clipboard.get does not currently work as we might 
want.  From what Serhiy has said, char *s points to invalid utf-8 bytes.

----------

_______________________________________
Python tracker <rep...@bugs.python.org>
<https://bugs.python.org/issue13153>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

Reply via email to