Eryk Sun <eryk...@gmail.com> added the comment:
local.normalize is generally wrong in Windows. It's meant for POSIX systems. Currently "tr_TR" is parsed as follows: >>> locale._parse_localename('tr_TR') ('tr_TR', 'ISO8859-9') The encoding "ISO8859-9" is meaningless to Windows. Also, the old CRT only ever supported either full language/country names or non-standard abbreviations -- e.g. either "Turkish_Turkey" or "trk_TUR". Having locale.getdefaultlocale() return ISO two-letter codes (e.g. "en_GB") was fundamentally wrong for the old CRT. (2.7 will die with this wart.) 3.5+ uses the Universal CRT, which does support standard ISO codes, but only in BCP 47 [1] locale names of the following form: language ISO 639 ["-" script] ISO 15924 ["-" region] ISO 3166-1 BCP 47 locale names have been preferred by Windows for the past 13 years, since Vista was released. Windows extends BCP 47 with a non-standard sort-order field (e.g. "de-Latn-DE_phoneb" is the German language with Latin script in the region of Germany with phone-book sort order). Another departure from strict BCP 47 in Windows is allowing underscore to be used as the delimiter instead of hyphen. In a concession to existing C code, the Universal CRT also supports an encoding suffix in BCP 47 locales, but this can only be either ".utf-8" or ".utf8". (Windows itself does not support specifying an encoding in a locale name, but it's Unicode anyway.) No other encoding is allowed. If ".utf-8" isn't specified, a BCP 47 locale defaults to the locale's ANSI codepage. However, there's no way to convey this in the locale name itself. Also, if a locale is Unicode only (e.g. Hindi), the CRT implicitly uses UTF-8 even without the ".utf-8" suffix. The following are valid BCP 47 locale names in the CRT: "tr", "tr.utf-8", "tr-TR", "tr_TR", "tr_TR.utf8", or "tr-Latn-TR.utf-8". But note that "tr_TR.1254" is not supported. The following shows that omitting the optional "utf-8" encoding in a BCP 47 locale makes the CRT default to the associated ANSI codepage. >>> locale.setlocale(locale.LC_CTYPE, 'tr_TR') 'tr_TR' >>> ucrt.___lc_codepage_func() 1254 C ___lc_codepage_func() queries the codepage of the current locale. We can directly query this codepage for a BCP 47 locale via GetLocaleInfoEx: >>> cpstr = (ctypes.c_wchar * 6)() >>> kernel32.GetLocaleInfoEx('tr-TR', ... LOCALE_IDEFAULTANSICODEPAGE, cpstr, len(cpstr)) 5 >>> cpstr.value '1254' If the result is '0', it's a Unicode-only locale (e.g. 'hi-IN' -- Hindi, India). Recent versions of the CRT use UTF-8 (codepage 65001) for Unicode-only locales: >>> locale.setlocale(locale.LC_CTYPE, 'hi-IN') 'hi-IN' >>> ucrt.___lc_codepage_func() 65001 Here are some example locale tuples that should be supported, given that the CRT continues to support full English locale names and non-standard abbreviations, in addition to the new BCP 47 names: ('tr', None) ('tr_TR', None) ('tr_Latn_TR, None) ('tr_TR', 'utf-8') ('trk_TUR', '1254') ('Turkish_Turkey', '1254') The return value from C setlocale can be normalized to replace hyphen delimiters with underscores, and "utf8" can be normalized as "utf-8". If it's a BCP 47 locale that has no encoding, GetLocaleInfoEx can be called to query the ANSI codepage. UTF-8 can be assumed if it's a Unicode-only locale. As to prefixing a codepage with 'cp', we don't really need to do this. We have aliases defined for most, such as '1252' -> 'cp1252'. But if the 'cp' prefix does get added, then the locale module should at least know to remove it when building a locale name from a tuple. [1] https://tools.ietf.org/rfc/bcp/bcp47.txt ---------- _______________________________________ Python tracker <rep...@bugs.python.org> <https://bugs.python.org/issue37945> _______________________________________ _______________________________________________ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com