On Sat, Dec 5, 2015 at 4:03 PM, Terry Reedy <tjre...@udel.edu> wrote: > On 12/5/2015 2:44 PM, Random832 wrote: >> As someone else pointed out, I meant that as a list of codepages >> which support all Unicode codepoints, not a list of codepoints >> not supported by Tk's UCS-2. Sorry, I assumed everyone knew >> offhand that 65001 was UTF-8 > > So Microsoft claims, but it is not terribly useful.
Using codepage 65001 is how one encodes/decodes UTF-8 using the Windows API, i.e. WideCharToMultiByte and MultiByteToWideChar. If you're just referring to the console, then I agree for the most part. The console, even in Windows 10, still has two major flaws when using UTF-8. The biggest problem is that non-ASCII input gets read as EOF (i.e. 0 bytes read) because of a bug in how conhost.exe (the process that hosts the console) converts its internal input buffer. Instead of dynamically determining how many characters to encode based on the current codepage, it assumes an N byte user buffer is a request for N characters, which obviously fails with non-ASCII UTF-8. What's worse is that it doesn't fail the call. It returns to the client that it successfully read 0 bytes.This causes Python's REPL to quit and input() to raise EOFError. The 2nd problem that still exists in Windows 10 is that the console doesn't save state across writes, so a 2-4 byte UTF-8 code sequence that gets split into 2 writes due to buffering gets displayed in the console as 2-4 replacement characters (i.e. U+FFFD). Most POSIX terminals don't suffer from this problem because they natively use 8-bit strings, whereas Windows transcodes to UTF-16. Prior to Windows 8, there's another annoying bug. WriteFile and WriteConsoleA return the number of wchar_t elements written instead of the number of bytes written. So a buffered writer will write successively smaller slices of the output buffer until the two numbers agree. You end up with a (potentially long) trail of garbage at the end of every write that contains non-ASCII characters. Since Windows doesn't allow UTF-8 as the system codepage (i.e. the [A]NSI API), it's probably only by accident that UTF-8 works in the console at all. Unicode works best (though not perfectly) via the console's wide-character API. The win-unicode-console package provides this functionality for Python 2 and 3. > Currently, on my Win 10 system, 'chcp 65001' results in > sys.stdout.encoding = 'cp65001', and > > for cp in 1200, 1201, 12000, 12001, 65000, 65001, 54936: > print(chr(cp)) > running without the usual exception. But of the above numbers > mis-interpreted as codepoints, only 1200 and 1201 print anything other than > a box with ?, whereas IDLE printed 3 other chars for 3 other assigned > codepoints. If I change the console font to Lucida Console, which I use in > IDLE, even chr(1200) gives a box. 65000 and 65001 aren't characters. Code points 12000, 12001 and 54936 are East-Asian characters: >>> from unicodedata import name, east_asian_width >>> for n in (12000, 12001, 54936): ... c = chr(n) ... print(n, east_asian_width(c), name(c)) ... 12000 W CJK RADICAL C-SIMPLIFIED EAT 12001 W CJK RADICAL HEAD 54936 W HANGUL SYLLABLE HOELS The console window can't mix narrow glyphs with wide glyphs. Its font rendering still has mostly the same limitations that it had when it debuted in Windows NT 3.1 (1993). To display wide CJK glyphs in the console, set the system locale to an East-Asian region and restart Windows (what a piece of... cake). The console also stores only one 16-bit wchar_t code per character cell, so a UTF-16 surrogate pair representing a non-BMP character (e.g. one of the popular emoji characters) displays as two rectangle glyphs. However, at least the code values are preserved when copied from the console to a window that displays UTF-16 text properly. Alternatively, use ConEmu [1] to hide the original console and display its contents in a window that handles text more flexibly. It also hacks the console API via DLL injection to work around bugs and provide Xterm emulation. [1]: http://conemu.github.io -- https://mail.python.org/mailman/listinfo/python-list