Eryk Sun <eryk...@gmail.com> added the comment:

> On Windows 10 (version 1903), ANSI code page 1252, OEM code page 437, 
> LC_CTYPE locale "French_France.1252"

The CRT default locale (i.e. the empty locale "") uses the user locale, which 
is the "Format" value on the Region->Formats tab. It does not use the system 
locale from the Region->Administrative tab. 

The default locale normally uses the user locale's ANSI codepage, as returned 
by GetLocaleInfoEx(LOCALE_NAME_USER_DEFAULT, LOCALE_IDEFAULTANSICODEPAGE, ...). 
But if the active codepage of the process is UTF-8, then GetACP(), GetOEMCP(), 
and setlocale(LC_CTYPE, "") all use UTF-8 (i.e. CP_UTF8, i.e. 65001). The 
active codepage can be set to UTF-8 either at the system-locale level or in the 
application-manifest. For example, with the active codepage setting in the 
manifest:

    C:\>python.utf8.exe -q

    >>> from locale import setlocale, LC_CTYPE
    >>> setlocale(LC_CTYPE, "")
    'English_Canada.utf8'

    >>> kernel32 = ctypes.WinDLL('kernel32', use_last_error=True)
    >>> kernel32.GetACP()
    65001
    >>> kernel32.GetOEMCP()
    65001

A default locale name can also specify the codepage to use. It could be UTF-8, 
a particular codepage, ".ACP" (ANSI), or ".OCP" (OEM). "ACP" and "OCP" have to 
be in upper case. For example:

    >>> setlocale(LC_CTYPE, '.utf8')
    'English_Canada.utf8'
    >>> setlocale(LC_CTYPE, '.437')
    'English_Canada.437'

    >>> setlocale(LC_CTYPE, ".ACP")
    'English_Canada.1252'
    >>> setlocale(LC_CTYPE, ".OCP")
    'English_Canada.850'

Otherwise, if you provide a known locale -- using full names, or three-letter 
abbreviations, or from the small set of locale aliases, then setlocale queries 
any missing values from the NLS database. 

One snag in the road is the set of Unicode-only locales, such as "Hindi_India". 
Querying the ANSI and OEM codepages for a Unicode-only locale respectively 
returns CP_ACP (0) and CP_OEMCP (1). It used to be that the CRT would end up 
using the system locale for these cases. But recently ucrt has switched to 
using UTF-8 for these cases. For example:

    >>> setlocale(LC_CTYPE, "Hindi_India")
    'Hindi_India.utf8'

That brings us to the case of modern Windows BCP-47 locale names, which usually 
lack an implicit encoding. For example:

    >>> setlocale(LC_CTYPE, "hi_IN")
    'hi_IN'

The current CRT codepage can be queried via __lc_codepage_func:

    >>> import ctypes; ucrt = ctypes.CDLL('ucrtbase', use_errno=True)
    >>> ucrt.___lc_codepage_func()
    65001

With the exception of Unicode-only locales, using a modern name without an 
encoding defaults to the named locale's ANSI codepage. For example:

    >>> setlocale(LC_CTYPE, "en_CA")
    'en_CA'
    >>> ucrt.___lc_codepage_func()
    1252

The only encoding allowed in BCP-47 locale names is ".utf8" or ".utf-8" (case 
insensitive):

    >>> setlocale(LC_CTYPE, "fr_FR.utf8")
    'fr_FR.utf8'
    >>> setlocale(LC_CTYPE, "fr_FR.UTF-8")
    'fr_FR.UTF-8'

No other encoding is allowed with this form. For example:

    >>> try: setlocale(LC_CTYPE, "fr_FR.ACP")
    ... except Exception as e: print(e)
    ...
    unsupported locale setting
    >>> try: setlocale(LC_CTYPE, "fr_FR.1252")
    ... except Exception as e: print(e)
    ...
    unsupported locale setting

As to the "tr_TR" locale bug, the Windows implementation is broken due to 
assumptions that POSIX locale names are directly supported. A significant 
redesign is required to connect the dots.

    >>> from locale import getlocale
    >>> setlocale(LC_CTYPE, 'tr_TR')
    'tr_TR'
    >>> ucrt.___lc_codepage_func()
    1254

    >>> getlocale(LC_CTYPE)
    ('tr_TR', 'ISO8859-9')

Codepage 1254 is similar to ISO8859-9, except, in typical fashion, Microsoft 
assigned most of the upper control range 0x80-0x9F to an assortment of 
characters it deemed useful, such as the Euro symbol "€". The exact codepage 
needs to be queried via __lc_codepage_func() and returned as ('tr_TR', 
'cp1254'). 

Conversely, setlocale() needs to know that this BCP-47 name does not support an 
explicit encoding, unless it's "utf8". If the given codepage, or an associated 
alias, doesn't match the locale's ANSI codepage, then the locale name has to be 
expanded to the full name "Turkish_Turkey". The long name allows specifying an 
arbitrary codepage. 

For example, say we have ('tr_TR', 'ISO8859-7'), i.e. Greek with Turkish locale 
rules. This transforms to the closest approximation ('tr_TR', '1253'). When 
setlocale queries the OS, it will find that the ANSI codepage is actually 1254, 
so it cannot use "tr_TR" or "tr-TR". It needs to expand to the long form:

    >>> setlocale(LC_CTYPE, 'Turkish_Turkey.1253')
    'Turkish_Turkey.1253'

----------

_______________________________________
Python tracker <rep...@bugs.python.org>
<https://bugs.python.org/issue38324>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

Reply via email to