[issue45120] Windows cp encodings "UNDEFINED" entries update

Eryk Sun Fri, 17 Sep 2021 15:34:37 -0700

Eryk Sun <eryk...@gmail.com> added the comment:

Rafael, I was discussing code_page_decode() and code_page_encode() both as an 
alternative for compatibility with other programs and also to explore how 
MultiByteToWideChar() and WideCharToMultiByte() work -- particularly to explain 
best-fit mappings, which do not roundtrip. MultiByteToWideChar() does not 
exhibit "best fit" behavior. I don't even know what that would mean in the 
context of decoding.


With the exception of one change to code page 1255, the definitions that you're 
looking to add are just for the C1 controls and private use area codes, which 
are not meaningful. Windows uses these arbitrary definitions to be able to 
roundtrip between the system ANSI and Unicode APIs.

Note that Python's "mbcs" (i.e. "ansi") and "oem" encodings use the code-page 
codec. For example:

    >>> _winapi.GetACP()
    1252

    >>> '\x81\x8d\x8f\x90\x9d'.encode('ansi')
    b'\x81\x8d\x8f\x90\x9d'

Best-fit encode "α" in code page 1252 [1]:

    >>> 'α'.encode('ansi', 'replace')
    b'a'

In your PR, the change to code page 1255 to add b"\xca" <-> "\u05ba" is the 
only change that I think is really worthwhile because the unicode.org data has 
it wrong. You can get the proper character name for the comment using the 
unicodedata module:

    >>> print(unicodedata.name('\u05ba'))
    HEBREW POINT HOLAM HASER FOR VAV

I'm +0 in favor of leaving the mappings undefined where Windows completes 
legacy single-byte code pages by using C1 control codes and private use area 
codes. It would have been fine if Python's code-page encodings had always been 
based on the "WindowsBestFit" tables, but only the decoding MBTABLE, since it's 
reasonable. 

Ideally, I don't want anything to use the best-fit mappings in WCTABLE. I would 
rather that the 'replace' handler for code_page_encode() used the replacement 
character (U+FFFD) or system default character. But the world is not ideal; the 
system ANSI API uses the WCTABLE best-fit encoding. Back in the day with Python 
2.7, it was easy to demonstrate how insidious this is. For example, in 2.7.18:

    >>> os.listdir(u'.')
    [u'\u03b1']

    >>> os.listdir('.')
    ['a']

---

[1] 
https://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WindowsBestFit/bestfit1252.txt

----------

_______________________________________
Python tracker <rep...@bugs.python.org>
<https://bugs.python.org/issue45120>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue45120] Windows cp encodings "UNDEFINED" entries update

Reply via email to