On Sun, Jan 24, 2021 at 01:32:28AM +0000, MRAB wrote:
> On 2021-01-24 01:14, Guido van Rossum wrote:
> >I have definitely seen BOMs written by Notepad on Windows 10.
> >
> >Why can’t the future be that open() in text mode guesses the encoding?
> >
> "In the face of ambiguity, refuse the temptation to guess."
"Although practicality beats purity."
The Zen is like scripture: there's a koan for any position you wish to
take :-)
If you want to be pedantic, and I certainly do *wink*, providing any
default for the encoding parameter is a guess. The encoding of all text
files is ambiguous (the intended encoding is metadata which is not
recorded in the file format). Most text files on Linux and Mac OS use
UTF-8, and many on Windows too, but not *all* so setting the default to
UTF-8 is just a guess.
I understand that there are good heuristics for auto-detection of
encodings which are reliable and used in many other software. If
auto-detection is a "guess", its an *educated* guess and not much
different from the status quo, which usually guesses correctly on Linux
and Mac but often guesses wrongly on Windows. This proposal is to
improve the quality of the guess by inspecting the file's contents.
For example, a file opened in text mode where every second character is
a NULL is *almost certainly* UTF-16. The chances that somebody actually
intended to write:
H\0e\0l\0l\0o\O \OW\0o\0r\0l\0d\0
rather than "Hello World" is negligible.
Before we consider changing the default encoding to "auto-detect", I
would like to see some estimate of how many UTF-8 encoded files will be
misclassified as something else. That is, if we make this change, how
much software that currently guesses UTF-8 correctly (the default
encoding is the actual intended encoding) will break because it guesses
something else? That surely won't happen with mostly-ASCII files, but I
suppose it could happen with some non-English languages?
--
Steve
_______________________________________________
Python-ideas mailing list -- [email protected]
To unsubscribe send an email to [email protected]
https://mail.python.org/mailman3/lists/python-ideas.python.org/
Message archived at
https://mail.python.org/archives/list/[email protected]/message/U2T4JSKOUGSEXVVW3Y7LTXR7HQ5UJUKI/
Code of Conduct: http://python.org/psf/codeofconduct/