On Fri, 16 Dec 2022 at 19:21:37 +0100, Adam Borowski wrote: > As of Bookworm, legacy locales are no longer officially supported.
For clarity, I think when you say "legacy locales" you mean locales whose character encoding is either explicitly or implicitly something other than UTF-8 ("legacy national encodings"), like en_US (implicitly ISO-8859-1 according to /usr/share/i18n/SUPPORTED) and en_GB.ISO-8859-15 (explicitly ISO-8859-15 in its name). True? Many of the non-UTF-8 encodings are single-byte encodings in the ISO-8859 family, but if I understand correctly, your reasoning applies equally to multi-byte east Asian encodings like BIG5, GB18030 and EUC-JP. Also true? Meanwhile, locales with a UTF-8 character encoding, like en_AG (implicitly UTF-8 according to /usr/share/i18n/SUPPORTED) or en_US.UTF-8 (explicitly UTF-8), are the ones you are considering to be non-legacy. Also true? I think for Policy use, this would have to say something more precise, like "locales with a non-UTF-8 character encoding". I wouldn't want to get en_US speakers trying to argue that en_GB.UTF-8 is a legacy locale, or en_GB speakers like me trying to argue that en_US.UTF-8 is a legacy locale :-) When you say "officially supported" here, do you refer to the extent to which they are supported by the glibc maintainers, or some other group? Or are you describing a change request that they *should not* be officially supported by Debian - something that is not necessarily true yet, but in this bug you are asking for it to become true? > * Software may assume they always run in an UTF-8 locale, and emit or > require UTF-8 input/output without checking. I suspect this is already common: for example, ikiwiki is strictly UTF-8-only and ignores locales' character sets, which is arguably a bug right now but would become a non-bug with your proposed policy. This is a "may" so it can't possibly make a package gain bugs. It might make packages have fewer bugs. > * The execution environment (usually init system or a container) must > default to UTF-8 encoding unless explicitly configured otherwise. Is this already true? This seems like the sort of thing which should be fixed in at least the major init systems and container managers before it goes into Policy, in the interests of not making those init systems and container managers retroactively buggy. > * Legacy locales are no longer officially supported, and packages may > drop support for them and/or exclude them from their testsuites. > * Packages may retain support for legacy locales, but related bug reports > (unless security related) are considered to be of wishlist severity. Is the C (aka POSIX) locale still a non-UTF-8 locale (if I remember correctly its character encoding is officially 7-bit ASCII), or has it been redefined to be UTF-8? Given the special status of the C locale in defaults and standards, it might be necessary to say that it's the only supported locale with a non-UTF-8 character encoding. > * Filesystems may be configured to reject file names that are not valid > printable UTF-8 encoded Unicode. To put this in terms of the requirements that Policy puts on packages, is this really a should/must in disguise: packages should/must not assume that they can successfully read/write filenames that are not valid printable UTF-8-encoded Unicode? This seems like a change with a wider scope: not only is it excluding filenames in Latin-1 or whatever, it's also excluding filenames with non-printable characters (tabs, control characters etc.), or with the UTF-8 representation of a noncharacter like U+FDEF. Perhaps that should be a change orthogonal to de-supporting the non-UTF-8 locales? > * Human-readable files outside of packages' private data must be encoded > in UTF-8. This applies especially to files in /usr/share/doc and /etc > but applies to eg. executable scripts in /bin or /sbin as well. It's not immediately obvious to me what "human-readable files" means here. Text files? Text files in ASCII-compatible encodings? Files intended to be read and written by standard text editors? I assume the intention here is to make it a policy violation to ship documentation, scripts, configuration files, etc. encoded in something like ISO-8859-1 or EUC-JP? Is this intended to make it a policy violation to ship documentation, etc. encoded in UTF-16? > * So-called BOM (U+FEFF) must not be added to plain-text output, and if > present, editors/viewers customarily used for editing code should not > hide its presence. This seems to me like it should perhaps be out-of-scope here, and treated as a separate change: UTF-8 is still UTF-8, whether it starts with U+FEFF or not, and I think deprecating en_GB in favour of en_GB.UTF-8 (and so on) is orthogonal to deprecating the use of a U+FEFF prefix on UTF-8 text. I think "UTF-8 output" is probably a better scope for this than "plain-text output": my understanding is that when emitting UTF-16, UCS-2 or UCS-4 it's conventional (perhaps even recommended?) to emit a BOM first, because in those encodings of Unicode, either LE or BE byte order is reasonable (unlike UTF-8, which is always MSB-first by design). Perhaps you meant this to be implicit, because to a Unix developer, "plain text" is implicitly something ASCII-compatible (which rules out every Unicode encoding except UTF-8), and legacy national encodings cannot represent U+FEFF (which rules those out), leaving UTF-8 as the only "plain text" encoding where U+FEFF is even representable? It seems to me that it shouldn't be a Policy violation for things like text editors and character set converters to have the option to emit UTF-8-with-U+FEFF-prefix, but maybe it should instead be a Policy violation for that to be the default. smcv