New submission from STINNER Victor <vstin...@python.org>:

While working on bpo-46659, I found a bug in the encodings "mbcs" alias. Even 
if the function has 2 tests (in test_codecs and test_site), both tests missed 
the bug :-(

I fixed the alias with this change:
---
commit 04dd60e50cd3da48fd19cdab4c0e4cc600d6af30
Author: Victor Stinner <vstin...@python.org>
Date:   Sun Feb 6 21:50:09 2022 +0100

    bpo-46659: Update the test on the mbcs codec alias (GH-31168)
    
    encodings registers the _alias_mbcs() codec search function before
    the search_function() codec search function. Previously, the
    _alias_mbcs() was never used.
    
    Fix the test_codecs.test_mbcs_alias() test: use the current ANSI code
    page, not a fake ANSI code page number.
    
    Remove the test_site.test_aliasing_mbcs() test: the alias is now
    implemented in the encodings module, no longer in the site module.
---

But Eryk found two bugs:

"""


This was never true before. With 1252 as my ANSI code page, I checked 
codecs.lookup('cp1252') in 2.7, 3.4, 3.5, 3.6, 3.9, and 3.10, and none of them 
return the "mbcs" encoding. It's not equivalent, and not supposed to be. The 
implementation of "cp1252" should be cross-platform, regardless of whether 
we're on a Windows system with 1252 as the ANSI code page, as opposed to a 
Windows system with some other ANSI code page, or a Linux or macOS system.

The differences are that "mbcs" maps every byte, whereas our code-page 
encodings do not map undefined bytes, and the "replace" handler of "mbcs" uses 
a best-fit mapping (e.g. "α" -> "a") when encoding text, instead of mapping all 
undefined characters to "?".
"""

and my new test fails if PYTHONUTF8=1 env var is set:

"""
This will fail if PYTHONUTF8 is set in the environment, because it overrides 
getpreferredencoding(False) and _get_locale_encoding().
"""

The code for the "mbcs" alias changed at lot between Python 3.5 and 3.7.

In Python 3.5, site module:
---
def aliasmbcs():
    """On Windows, some default encodings are not provided by Python,
    while they are always available as "mbcs" in each locale. Make
    them usable by aliasing to "mbcs" in such a case."""
    if sys.platform == 'win32':
        import _bootlocale, codecs                        
        enc = _bootlocale.getpreferredencoding(False)
        if enc.startswith('cp'):            # "cp***" ?
            try:
                codecs.lookup(enc)
            except LookupError:
                import encodings
                encodings._cache[enc] = encodings._unknown
                encodings.aliases.aliases[enc] = 'mbcs'
---

In Python 3.6, encodings module:
---
(...)
codecs.register(search_function)

if sys.platform == 'win32':
    def _alias_mbcs(encoding):
        try:
            import _bootlocale
            if encoding == _bootlocale.getpreferredencoding(False):
                import encodings.mbcs
                return encodings.mbcs.getregentry()
        except ImportError:
            # Imports may fail while we are shutting down
            pass

    codecs.register(_alias_mbcs)
---

Python 3.7, encodings module:
---
(...)
codecs.register(search_function)

if sys.platform == 'win32':
    def _alias_mbcs(encoding):
        try:
            import _winapi
            ansi_code_page = "cp%s" % _winapi.GetACP()
            if encoding == ansi_code_page:
                import encodings.mbcs
                return encodings.mbcs.getregentry()
        except ImportError:
            # Imports may fail while we are shutting down
            pass

    codecs.register(_alias_mbcs)
---

The Python 3.6 and 3.7 "codecs.register(_alias_mbcs)" doesn't work because 
"search_function()" is tested before and it works for "cpXXX" encodings. My 
changes changes the order in which codecs search functions are registered: 
first the MBCS alias, then the encodings search_function().

In Python 3.5, the alias was only created if Python didn't support the code 
page.

----------
components: Library (Lib)
messages: 412678
nosy: vstinner
priority: normal
severity: normal
status: open
title: encodings: the "mbcs" alias doesn't work
versions: Python 3.11

_______________________________________
Python tracker <rep...@bugs.python.org>
<https://bugs.python.org/issue46668>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

Reply via email to