[issue9377] socket, PEP 383: Mishandling of non-ASCII bytes in host/domain names

David Watson Tue, 19 Oct 2010 16:36:40 -0700

David Watson <[email protected]> added the comment:

> > In fact, I would think that non-ASCII bytes in a hostname most
> > probably indicated that a name resolution mechanism other than
> > the DNS was in use, and that the byte string should be passed
> > unaltered just as a typical C program would.
> 
> I'm not talking about byte strings, but character strings.


I mean that passing the str object from socket.gethostname() to
the Python lookup function ought to result in the same byte
string being passed to the C lookup function as was returned by
the C gethostname() function (or else that the programmer must
re-encode the str to ensure that that result is obtained).

> > I don't object to that, but it does force a choice between
> > decoding an 8-bit name for display (e.g. by using the locale
> > encoding), and decoding it to round-trip automatically (e.g. by
> > using ASCII/surrogateescape, with support on the encoding side).
> 
> In the face of ambiguity, refuse the temptation to guess.

Yes, I would interpret that to mean not using the locale encoding
for data obtained from the network.  That's another reason why
the ASCII/surrogateescape scheme appeals to me more.

> Well, Python is not C. In Python, you would pass a str, and
> expect it to work, which means it will get automatically encoded
> with IDNA.

I think there might be a misunderstanding here - I've never
proposed changing the interpretation of Unicode characters in
hostname arguments.  The ASCII/surrogateescape scheme I suggested
only changes the interpretation of unpaired surrogate codes, as
they do not occur in IDNs or any other genuine Unicode data; all
IDNs, including those solely consisting of ASCII characters,
would be encoded to the same byte sequence as before.

ASCII/surrogateescape decoding could also be used without support
on the encoding side - that would satisfy the requirement to
"refuse the temptation to guess", would allow the original bytes
to be recovered, and would mean that attempting to look up a
non-ASCII result in str form would raise an exception rather than
looking up the wrong name.

> Marc-Andre wants gethostname to use the Wide API on Windows, which,
> in theory, allows for cases where round-tripping to bytes is
> impossible.

Well, the name resolution APIs wrapped by Python are all
byte-oriented, so if the computer name were to have no bytes
equivalent then it wouldn't be possible to resolve it anyway, and
an exception rightly ought be raised at some point in the process
of trying to do so.

----------
title: socket,  PEP 383: Mishandling of non-ASCII bytes in host/domain names -> 
socket, PEP 383: Mishandling of non-ASCII bytes in host/domain names

_______________________________________
Python tracker <[email protected]>
<http://bugs.python.org/issue9377>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue9377] socket, PEP 383: Mishandling of non-ASCII bytes in host/domain names

Reply via email to