Eryk Sun added the comment: >> so ANSI is the natural default for a detached process > > To clarify - ANSI is the natural default *for programs that > don't support Unicode*.
By natural, I meant in the context of using GetConsoleOutputCP(), since WideCharToMultiByte(0, ...) encodes text as ANSI. Clearly UTF-16LE is preferred for IPC on Windows. It's the native Unicode format down to the lowest levels of the kernel. But we're talking about old-school IPC using standard I/O pipelines, for which I think UTF-8 is a better fit. > Forcing the use of UTF-8 as the code page is the easiest way > for us to support it. The console's behavior for codepage 65001 is too buggy. The show stopper is that it limits input to ASCII. The console allocates a temporary buffer for the encoded text that's sized assuming 1 ANSI/OEM byte per UTF-16 code. So if you enter non-ASCII characters, WideCharToMultiByte fails in conhost.exe. But the console returns that the operation has successfully read 0 bytes. Python's REPL and input() see this as EOF. For example: import sys, ctypes, msvcrt kernel32 = ctypes.WinDLL('kernel32', use_last_error=True) conin = open(r'\\.\CONIN$', 'r+') h = msvcrt.get_osfhandle(conin.fileno()) buf = (ctypes.c_char * 15)() n = (ctypes.c_ulong * 1)() >>> sys.stdin.encoding 'cp65001' ReadFile test in Windows 10: >>> kernel32.ReadFile(h, buf, 15, n, None) Test! 1 >>> n[0], buf[:] (7, b'Test!\r\n\x00\x00\x00\x00\x00\x00\x00\x00') >>> kernel32.ReadFile(h, buf, 15, n, None) ¡Prueba! 1 >>> n[0], buf[:] (0, b'Test!\r\n\x00\x00\x00\x00\x00\x00\x00\x00') The second call obviously fails, even thought it returns 1. The input contains non-ASCII "¡", which in UTF-8 requires 2 bytes, b'\xc2\xa1'. This causes the failure in conhost.exe that I described above. ReadConsoleA has the same problem: >>> kernel32.ReadConsoleA(h, buf, 15, n, None) Hello World! 1 >>> n[0], buf[:] (14, b'Hello World!\r\n\x00') >>> kernel32.ReadConsoleA(h, buf, 15, n, None) ¡Hola Mundo! 1 >>> n[0], buf[:] (0, b'Hello World!\r\n\x00') UTF-8 output is also buggy prior to Windows 8. The problem is that WriteFile returns the number of UTF-16 codes written instead of the number of bytes. For non-ASCII characters in the BMP, 1 UTF-16 code is 2 or 3 UTF-8 bytes. So it looks like a partial write. A buffered writer will loop multiple times to write what appears to be the remaining bytes, in a trail of junk lines in proportion to the number of non-ASCII characters written. Python could work around this by decoding the buffer to get the corresponding number of UTF-16 codes written in the console, but child processes may also be subject to this bug. The only general solution on Windows 7 is to use something like ANSICON, which uses DLL injection to hook and wrap WriteFile and WriteConsoleA. There's also a UTF-8 related bug in ulib.dll. This bug affects programs that do console codepage conversions, such as more.com. This in turn affects Python's interactive help(). I looked at this in issue 19914. The ulib bug is fixed in Windows 10. I don't know whether it's fixed in Windows 8, but it's there in Windows 7 (supported until 2020). > This would make Python's implementation much more > complicated, as well as breaking some scripts and > existing packages. Unless you're talking about major breakage, I think switching to the wide-character API is worth it, as the only viable path to supporting Unicode in the console. The implementation probably should transcode between UTF-16LE and UTF-8, so pure Python never sees UTF-16 byte strings. sys.std*.encoding would be 'utf-8'. os.read and os.write would be implemented as _Py_read and _Py_write (already exists). For console handles these could delegate to _Py_console_read and _Py_console_write, to convert between UTF-8 and UTF-16LE and call ReadConsoleW and WriteConsoleW. ---------- _______________________________________ Python tracker <rep...@bugs.python.org> <http://bugs.python.org/issue27179> _______________________________________ _______________________________________________ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com