Re: [Python-ideas] PEP 540: Add a new UTF-8 mode

Petr Viktorin Wed, 11 Jan 2017 03:23:24 -0800

On 01/11/2017 11:46 AM, Stephan Houben wrote:

Hi INADA Naoki,


(Sorry, I am unsure if INADA or Naoki is your first name...)

While I am very much in favour of everything working "out of the box",
an issue is that we don't have control over external code
(be it Python extensions or external processes invoked from Python).

And that code will only look at LANG/LC_TYPE and ignore any cleverness
we build into Python.

For example, this may mean that a built-in Python string sort will give you
a different ordering than invoking the external "sort" command.
I have been bitten by this kind of issues, leading to spurious "diffs" if
you try to use sorting to put strings into a canonical order.

AFAIK, this would not be a problem under PEP 538, which effectivelytreats the "C" locale as "C.UTF-8". Strings of Unicode codepoints andthe corresponding UTF-8-encoded bytes sort the same way.

Is that wrong, or do you have a better example of trouble with using"C.UTF-8" instead of "C"?

So my feeling is that people are ultimately not being helped by
Python trying to be "nice", since they will be bitten by locale issues
anyway. IMHO ultimately better to educate them to configure the locale.
(I realise that people may reasonably disagree with this assessment ;-) )

I would then recommend to set to en_US.UTF-8, which is slower and
less elegant but at least more widely supported.

What about the spurious diffs you'd get when switching from "C" to"en_US.UTF-8"?


$ LC_ALL=en_US.UTF-8 sort file.txt
a
a
A
A
$ LC_ALL=C sort file.txt
A
A
a
a

By the way, I know a bit how Node.js deals with locales, and it doesn't try
to compensate for "C" locales either. But what it *does* do is that
Node never uses the locale settings to determine the encoding of a file:
you either have to specify it explicitly OR it defaults to UTF-8 (the
latter on output only).
So in this respect it is by specification immune against
misconfiguration of the encoding.
However, other stuff (e.g. date formatting) will still be influenced by
the "C" locale
as usual.

I believe the main problem is that the "C" locale really means two verydifferent things:


a) Text is encoded as 7-bit ASCII; higher codepoints are an error
b) No encoding was specified

In both cases, treating "C" as "C.UTF-8" is not bad:
a) For 7-bit "text", there's no real difference between these locales
b) UTF-8 is a much better default than ASCII




_______________________________________________
Python-ideas mailing list
[email protected]
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/

Re: [Python-ideas] PEP 540: Add a new UTF-8 mode

Reply via email to