David Watson <bai...@users.sourceforge.net> added the comment: > The result from gethostname likely comes out of machine-local > configuration. It may have non-ASCII in it, which is then likely > encoded in the local encoding. When looking it up in DNS, IDNA > should be applied.
I would have thought that someone who intended a Unicode hostname to be looked up in its IDNA form would have encoded it using IDNA, rather than an 8-bit encoding - how many C programs would transcode the name that way, rather than just passing the char * from one interface to another? In fact, I would think that non-ASCII bytes in a hostname most probably indicated that a name resolution mechanism other than the DNS was in use, and that the byte string should be passed unaltered just as a typical C program would. > OTOH, output from gethostbyaddr likely comes out of the DNS itself. > Guessing what encoding it may have is futile - other than guessing > that it really ought to be ASCII. Sure, but that doesn't mean the result can't be made to round-trip if it turns out not to be ASCII. The guess that it will be ASCII is, after all, still a guess (as is the guess that it comes from the DNS). > Python's socket module is clearly focused on the internet, and > intends to support that well. So if you pass a non-ASCII > string, it will have to encode that using IDNA. If that's > not what you want to get, tough luck. I don't object to that, but it does force a choice between decoding an 8-bit name for display (e.g. by using the locale encoding), and decoding it to round-trip automatically (e.g. by using ASCII/surrogateescape, with support on the encoding side). Using PyUnicode_DecodeFSDefault() for the hostname or other returned names (thus decoding them for display) would make this issue solvable with programmer intervention - for instance, "socket.gethostbyaddr(socket.gethostname())" could be replaced by "socket.gethostbyaddr(os.fsencode(socket.gethostname()))", but programmers might well neglect to do this, given that no encoding was needed in Python 2. Also, even displaying a non-ASCII name decoded according to the locale creates potential for confusion, as when the user types the same characters into a Python program for lookup (again, barring programmer intervention), they will not represent the same byte sequence as the characters the user sees on the screen (as they will instead represent their IDNA ASCII-compatible equivalent). So overall, I do think it is better to decode names for automatic round-tripping rather than for display, but my main concern is simply that it should be possible to recover the original bytes so that round-tripping is at least possible. PyUnicode_DecodeFSDefault() would accomplish that much at least. ---------- title: socket, PEP 383: Mishandling of non-ASCII bytes in host/domain names -> socket, PEP 383: Mishandling of non-ASCII bytes in host/domain names _______________________________________ Python tracker <rep...@bugs.python.org> <http://bugs.python.org/issue9377> _______________________________________ _______________________________________________ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com