Eryk Sun <eryk...@gmail.com> added the comment:

local.normalize is generally wrong in Windows. It's meant for POSIX systems. 
Currently "tr_TR" is parsed as follows:

    >>> locale._parse_localename('tr_TR')
    ('tr_TR', 'ISO8859-9')

The encoding "ISO8859-9" is meaningless to Windows. Also, the old CRT only ever 
supported either full language/country names or non-standard abbreviations -- 
e.g. either "Turkish_Turkey" or "trk_TUR". Having locale.getdefaultlocale() 
return ISO two-letter codes (e.g. "en_GB") was fundamentally wrong for the old 
CRT. (2.7 will die with this wart.)

3.5+ uses the Universal CRT, which does support standard ISO codes, but only in 
BCP 47 [1] locale names of the following form:

    language           ISO 639
    ["-" script]       ISO 15924
    ["-" region]       ISO 3166-1

BCP 47 locale names have been preferred by Windows for the past 13 years, since 
Vista was released. Windows extends BCP 47 with a non-standard sort-order field 
(e.g. "de-Latn-DE_phoneb" is the German language with Latin script in the 
region of Germany with phone-book sort order). Another departure from strict 
BCP 47 in Windows is allowing underscore to be used as the delimiter instead of 
hyphen. 

In a concession to existing C code, the Universal CRT also supports an encoding 
suffix in BCP 47 locales, but this can only be either ".utf-8" or ".utf8". 
(Windows itself does not support specifying an encoding in a locale name, but 
it's Unicode anyway.) No other encoding is allowed. If ".utf-8" isn't 
specified, a BCP 47 locale defaults to the locale's ANSI codepage. However, 
there's no way to convey this in the locale name itself. Also, if a locale is 
Unicode only (e.g. Hindi), the CRT implicitly uses UTF-8 even without the 
".utf-8" suffix.

The following are valid BCP 47 locale names in the CRT: "tr", "tr.utf-8", 
"tr-TR", "tr_TR", "tr_TR.utf8", or "tr-Latn-TR.utf-8". But note that 
"tr_TR.1254" is not supported.

The following shows that omitting the optional "utf-8" encoding in a BCP 47 
locale makes the CRT default to the associated ANSI codepage. 

    >>> locale.setlocale(locale.LC_CTYPE, 'tr_TR')
    'tr_TR'
    >>> ucrt.___lc_codepage_func()
    1254

C ___lc_codepage_func() queries the codepage of the current locale. We can 
directly query this codepage for a BCP 47 locale via GetLocaleInfoEx:

    >>> cpstr = (ctypes.c_wchar * 6)()
    >>> kernel32.GetLocaleInfoEx('tr-TR',
    ...     LOCALE_IDEFAULTANSICODEPAGE, cpstr, len(cpstr))
    5
    >>> cpstr.value
    '1254'

If the result is '0', it's a Unicode-only locale (e.g. 'hi-IN' -- Hindi, 
India). Recent versions of the CRT use UTF-8 (codepage 65001) for Unicode-only 
locales:

    >>> locale.setlocale(locale.LC_CTYPE, 'hi-IN')
    'hi-IN'
    >>> ucrt.___lc_codepage_func()
    65001

Here are some example locale tuples that should be supported, given that the 
CRT continues to support full English locale names and non-standard 
abbreviations, in addition to the new BCP 47 names:

    ('tr', None)
    ('tr_TR', None)
    ('tr_Latn_TR, None)
    ('tr_TR', 'utf-8')
    
    ('trk_TUR', '1254')
    ('Turkish_Turkey', '1254')

The return value from C setlocale can be normalized to replace hyphen 
delimiters with underscores, and "utf8" can be normalized as "utf-8". If it's a 
BCP 47 locale that has no encoding, GetLocaleInfoEx can be called to query the 
ANSI codepage. UTF-8 can be assumed if it's a Unicode-only locale. 

As to prefixing a codepage with 'cp', we don't really need to do this. We have 
aliases defined for most, such as '1252' -> 'cp1252'. But if the 'cp' prefix 
does get added, then the locale module should at least know to remove it when 
building a locale name from a tuple.

[1] https://tools.ietf.org/rfc/bcp/bcp47.txt

----------

_______________________________________
Python tracker <rep...@bugs.python.org>
<https://bugs.python.org/issue37945>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

Reply via email to