On Mon, Jan 25, 2021 at 12:33 AM Steven D'Aprano <[email protected]> wrote:
>
> On Sat, Jan 23, 2021 at 03:24:12PM +0000, Barry Scott wrote:
>
> > I think that you are going to create a bug magnet if you attempt to auto
> > detect the encoding.
> >
> > First problem I see is that the file may be a pipe and then you will block
> > until you have enough data to do the auto detect.
>
> Can you use `open('filename')` to read a pipe?
Yes. You can even use it with stdin:
>>> open("/proc/self/fd/0").read(1)
a
'a'
The second line was me typing something, even though I was otherwise
at the REPL.
> Is blocking a problem in practice? If you try to open a network file,
> that could block too, if there are network issues. And since you're
> likely to follow the open with a read, the read is likely to block. So
> over all I don't think that blocking is an issue.
Definitely could be a problem if you read too much just for the sake
of autodetection. It needs to be possible to do everything with an
absolute minimum of reading.
> > Second problem is that the first N bytes are all in ASCII and only later
> > do you see Windows code page signature (odd lack of utf-8 signature).
>
> UTF-8 is a strict superset of ASCII, so if the file is actually
> ASCII, there is no harm in using UTF-8.
>
> The bigger issue is if you have N bytes of pure ASCII followed by some
> non-UTF superset, such as one of the ISO-8859-* encodings. So you end up
> detecting what you think is ASCII/UTF-8 but is actually some legacy
> encoding. But if N is large, say 512 bytes, that's unlikely in practice.
There's no problem if you think it's ASCII, so the only problem would
be if you start thinking that it's UTF-8 and then discover that it
isn't. The scheme used by UTF-8 is designed such that this is highly
unlikely with random data or actual text in an eight-bit encoding, so
it's most likely to be broken UTF-8 than legit ISO-8859-X.
ChrisA
_______________________________________________
Python-ideas mailing list -- [email protected]
To unsubscribe send an email to [email protected]
https://mail.python.org/mailman3/lists/python-ideas.python.org/
Message archived at
https://mail.python.org/archives/list/[email protected]/message/MBBCCHLFHFHYPCS54AKOVOCA4ELBFNPD/
Code of Conduct: http://python.org/psf/codeofconduct/