[issue9377] socket, PEP 383: Mishandling of non-ASCII bytes in host/domain names

David Watson Mon, 18 Oct 2010 11:14:10 -0700

David Watson <[email protected]> added the comment:

> The result from gethostname likely comes out of machine-local
> configuration. It may have non-ASCII in it, which is then likely
> encoded in the local encoding. When looking it up in DNS, IDNA
> should be applied.


I would have thought that someone who intended a Unicode hostname
to be looked up in its IDNA form would have encoded it using
IDNA, rather than an 8-bit encoding - how many C programs would
transcode the name that way, rather than just passing the char *
from one interface to another?

In fact, I would think that non-ASCII bytes in a hostname most
probably indicated that a name resolution mechanism other than
the DNS was in use, and that the byte string should be passed
unaltered just as a typical C program would.

> OTOH, output from gethostbyaddr likely comes out of the DNS itself.
> Guessing what encoding it may have is futile - other than guessing
> that it really ought to be ASCII.

Sure, but that doesn't mean the result can't be made to
round-trip if it turns out not to be ASCII.  The guess that it
will be ASCII is, after all, still a guess (as is the guess that
it comes from the DNS).

> Python's socket module is clearly focused on the internet, and
> intends to support that well. So if you pass a non-ASCII
> string, it will have to encode that using IDNA. If that's
> not what you want to get, tough luck.

I don't object to that, but it does force a choice between
decoding an 8-bit name for display (e.g. by using the locale
encoding), and decoding it to round-trip automatically (e.g. by
using ASCII/surrogateescape, with support on the encoding side).

Using PyUnicode_DecodeFSDefault() for the hostname or other
returned names (thus decoding them for display) would make this
issue solvable with programmer intervention - for instance,
"socket.gethostbyaddr(socket.gethostname())" could be replaced by
"socket.gethostbyaddr(os.fsencode(socket.gethostname()))", but
programmers might well neglect to do this, given that no encoding
was needed in Python 2.

Also, even displaying a non-ASCII name decoded according to the
locale creates potential for confusion, as when the user types
the same characters into a Python program for lookup (again,
barring programmer intervention), they will not represent the
same byte sequence as the characters the user sees on the screen
(as they will instead represent their IDNA ASCII-compatible
equivalent).

So overall, I do think it is better to decode names for automatic
round-tripping rather than for display, but my main concern is
simply that it should be possible to recover the original bytes
so that round-tripping is at least possible.
PyUnicode_DecodeFSDefault() would accomplish that much at least.

----------
title: socket,  PEP 383: Mishandling of non-ASCII bytes in host/domain names -> 
socket, PEP 383: Mishandling of non-ASCII bytes in host/domain names

_______________________________________
Python tracker <[email protected]>
<http://bugs.python.org/issue9377>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue9377] socket, PEP 383: Mishandling of non-ASCII bytes in host/domain names

Reply via email to