David Watson <bai...@users.sourceforge.net> added the comment:
> Is this patch in response to an actual problem, or a theoretical problem?
> If "actual problem": what was the specific application, and what was the
> specific host name?
It's about environments, not applications - the local network may
be configured with non-ASCII bytes in hostnames (either in the
local DNS *or* a different lookup mechanism - I mentioned
/etc/hosts as a simple example), or someone might deliberately
connect from a garbage hostname as a denial of service attack
against a server which tries to look it up with gethostbyaddr()
or whatever (this may require a "non-strict" resolver library, as
noted above).
> If theoretical, I recommend to close it as "won't fix". I find it perfectly
> reasonable if Python's socket module gives an error if the hostname can't be
> clearly decoded. Applications that run into it as a result of gethostbyaddr
> should treat that as "no reverse name available".
There are two points here. One is that the decoding can fail; I
do think that programmers will find this surprising, and the fact
that Python refuses to return what was actually received is a
regression compared to 2.x.
The other is that the encoding and decoding are not symmetric -
hostnames are being decoded with UTF-8 but encoded with IDNA.
That means that when a decoded hostname contains a non-ASCII
character which is not prohibited by IDNA/Nameprep, that string
will, when used in a subsequent call, not refer to the hostname
that was actually received, because it will be re-encoded using a
different codec.
Attaching a refreshed version of try-surrogateescape-first.diff.
I've separated out the change to getnameinfo() as it may be
superfluous (issue #1027206).
----------
Added file: http://bugs.python.org/file18616/try-surrogateescape-first-4.diff
Added file:
http://bugs.python.org/file18617/try-surrogateescape-first-getnameinfo-4.diff
_______________________________________
Python tracker <rep...@bugs.python.org>
<http://bugs.python.org/issue9377>
_______________________________________
Accept ASCII/surrogateescape strings as hostname arguments.
diff --git a/Doc/library/socket.rst b/Doc/library/socket.rst
--- a/Doc/library/socket.rst
+++ b/Doc/library/socket.rst
@@ -49,6 +49,28 @@ supported. The address format required b
automatically selected based on the address family specified when the socket
object was created.
+When a hostname is returned by a system interface, it is decoded into
+a string using the ``'ascii'`` codec and the ``'surrogateescape'``
+error handler; this leaves ASCII bytes as ASCII, including IDNA
+ASCII-compatible encodings (see :mod:`encodings.idna`), but converts
+any non-ASCII bytes to the Unicode lone surrogate codes
+U+DC80...U+DCFF.
+
+Hostname arguments can be passed as strings or :class:`bytes` objects.
+The latter are passed to the system unchanged, while strings are
+encoded as follows: if a string contains only ASCII characters and/or
+the Unicode lone surrogate codes U+DC80...U+DCFF, it is encoded using
+the ``'ascii'`` codec and the ``'surrogateescape'`` error handler;
+otherwise it is converted to IDNA ASCII-compatible form using the
+``'idna'`` codec, and if this is not possible, :exc:`UnicodeError` is
+raised.
+
+.. versionchanged:: XXX
+ Previously, hostnames were decoded as UTF-8 and encoded using IDNA
+ or UTF-8; ``surrogateescape`` was not used; some interfaces
+ formerly accepted :class:`bytearray` objects, or did not accept
+ :class:`bytes` objects.
+
For IPv4 addresses, two special forms are accepted instead of a host address:
the empty string represents :const:`INADDR_ANY`, and the string
``'<broadcast>'`` represents :const:`INADDR_BROADCAST`. The behavior is not
diff --git a/Lib/test/test_socket.py b/Lib/test/test_socket.py
--- a/Lib/test/test_socket.py
+++ b/Lib/test/test_socket.py
@@ -322,6 +322,51 @@ class GeneralModuleTests(unittest.TestCa
except socket.error:
pass
+ def tryHostnameArgs(self, function, notfounderror):
+ # Call the given one-argument function with various valid and
+ # invalid representations of nonexistent hostnames. Check
+ # that it raises notfounderror for valid representations, and
+ # UnicodeError for invalid ones.
+
+ # An RFC 1123-compliant host name (".invalid" TLD is reserved
+ # under RFC 2606).
+ self.assertRaises(notfounderror, function, "host.domain.invalid")
+ # Previous name as a bytes object.
+ self.assertRaises(notfounderror, function, b"host.domain.invalid")
+ # A domain name with a non-ASCII octet, as bytes.
+ self.assertRaises(notfounderror, function, b"\xff.domain.invalid")
+ # Previous domain name as ASCII/surrogateescape string representation.
+ self.assertRaises(notfounderror, function, "\udcff.domain.invalid")
+ # A legal IDN.
+ self.assertRaises(notfounderror, function, "â¬.domain.invalid")
+ # A combination of the previous two, which may make sense in
+ # theory, but is not accepted (the Euro sign means it must be
+ # interpreted as an IDN, but it is is not a legal IDN, because
+ # it contains a surrogate character).
+ self.assertRaises(UnicodeError, function, "\udcff.â¬.domain.invalid")
+
+ def testGethostbynameHostnames(self):
+ self.tryHostnameArgs(socket.gethostbyname,
+ (socket.herror, socket.gaierror))
+
+ def testGethostbyname_exHostnames(self):
+ self.tryHostnameArgs(socket.gethostbyname_ex,
+ (socket.herror, socket.gaierror))
+
+ def testGethostbyaddrHostnames(self):
+ self.tryHostnameArgs(socket.gethostbyaddr,
+ (socket.herror, socket.gaierror))
+
+ def testGetaddrinfoHostnames(self):
+ self.tryHostnameArgs(lambda host: socket.getaddrinfo(host, None),
+ socket.gaierror)
+
+ def testSocketObjectHostnames(self):
+ def f(host):
+ s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
+ s.connect((host, 80))
+ self.tryHostnameArgs(f, socket.error)
+
def testNtoH(self):
# This just checks that htons etc. are their own inverse,
# when looking at the lower 16 or 32 bits.
diff --git a/Modules/socketmodule.c b/Modules/socketmodule.c
--- a/Modules/socketmodule.c
+++ b/Modules/socketmodule.c
@@ -739,6 +739,57 @@ PyThread_type_lock netdb_lock;
#endif
+/* "O&" converter for hostname arguments, which attempts to encode
+ strings as ASCII with "surrogateescape" error handler before
+ falling back to strict IDNA encoding. Also accepts bytes objects,
+ but not bytearray. On success, sets PyObject *addr to a new
+ reference to a bytes object, and may clear the error indicator. */
+
+static int
+hostname_converter(PyObject *arg, void *addr)
+{
+ PyObject **result = addr;
+
+ if (arg == NULL) {
+ Py_DECREF(result);
+ return 1;
+ }
+
+ if (PyUnicode_Check(arg)) {
+ PyObject *tmp;
+
+ tmp = PyUnicode_EncodeASCII(PyUnicode_AS_UNICODE(arg),
+ PyUnicode_GET_SIZE(arg),
+ "surrogateescape");
+ if (tmp == NULL) {
+ if (!PyErr_ExceptionMatches(PyExc_UnicodeEncodeError))
+ return 0;
+ PyErr_Clear();
+ tmp = PyUnicode_AsEncodedString(arg, "idna", "strict");
+ if (tmp == NULL)
+ return 0;
+ }
+ arg = tmp;
+ }
+ else
+ Py_INCREF(arg);
+ if (!PyBytes_Check(arg)) {
+ PyErr_SetString(PyExc_TypeError,
+ "hostnames must be str or bytes objects");
+ goto err;
+ }
+ if (PyBytes_GET_SIZE(arg) != strlen(PyBytes_AS_STRING(arg))) {
+ PyErr_SetString(PyExc_TypeError, "embedded NUL character in hostname");
+ goto err;
+ }
+ *result = arg;
+ return Py_CLEANUP_SUPPORTED;
+err:
+ Py_DECREF(arg);
+ return 0;
+}
+
+
/* Return the string representation for the given hostname. */
static PyObject *
@@ -1219,6 +1270,7 @@ getsockaddrarg(PySocketSockObject *s, Py
case AF_INET:
{
struct sockaddr_in* addr;
+ PyObject *hostobj;
char *host;
int port, result;
if (!PyTuple_Check(args)) {
@@ -1229,13 +1281,14 @@ getsockaddrarg(PySocketSockObject *s, Py
Py_TYPE(args)->tp_name);
return 0;
}
- if (!PyArg_ParseTuple(args, "eti:getsockaddrarg",
- "idna", &host, &port))
+ if (!PyArg_ParseTuple(args, "O&i:getsockaddrarg",
+ hostname_converter, &hostobj, &port))
return 0;
+ host = PyBytes_AS_STRING(hostobj);
addr=(struct sockaddr_in*)addr_ret;
result = setipaddr(host, (struct sockaddr *)addr,
sizeof(*addr), AF_INET);
- PyMem_Free(host);
+ Py_DECREF(hostobj);
if (result < 0)
return 0;
if (port < 0 || port > 0xffff) {
@@ -1254,6 +1307,7 @@ getsockaddrarg(PySocketSockObject *s, Py
case AF_INET6:
{
struct sockaddr_in6* addr;
+ PyObject *hostobj;
char *host;
int port, flowinfo, scope_id, result;
flowinfo = scope_id = 0;
@@ -1265,15 +1319,16 @@ getsockaddrarg(PySocketSockObject *s, Py
Py_TYPE(args)->tp_name);
return 0;
}
- if (!PyArg_ParseTuple(args, "eti|ii",
- "idna", &host, &port, &flowinfo,
+ if (!PyArg_ParseTuple(args, "O&i|ii",
+ hostname_converter, &hostobj, &port, &flowinfo,
&scope_id)) {
return 0;
}
+ host = PyBytes_AS_STRING(hostobj);
addr = (struct sockaddr_in6*)addr_ret;
result = setipaddr(host, (struct sockaddr *)addr,
sizeof(*addr), AF_INET6);
- PyMem_Free(host);
+ Py_DECREF(hostobj);
if (result < 0)
return 0;
if (port < 0 || port > 0xffff) {
@@ -3009,15 +3064,17 @@ socket_gethostbyname(PyObject *self, PyO
{
char *name;
sock_addr_t addrbuf;
- PyObject *ret = NULL;
-
- if (!PyArg_ParseTuple(args, "et:gethostbyname", "idna", &name))
+ PyObject *nameobj, *ret = NULL;
+
+ if (!PyArg_ParseTuple(args, "O&:gethostbyname",
+ hostname_converter, &nameobj))
return NULL;
+ name = PyBytes_AS_STRING(nameobj);
if (setipaddr(name, SAS2SA(&addrbuf), sizeof(addrbuf), AF_INET) < 0)
goto finally;
ret = makeipaddr(SAS2SA(&addrbuf), sizeof(struct sockaddr_in));
finally:
- PyMem_Free(name);
+ Py_DECREF(nameobj);
return ret;
}
@@ -3162,6 +3219,7 @@ gethost_common(struct hostent *h, struct
static PyObject *
socket_gethostbyname_ex(PyObject *self, PyObject *args)
{
+ PyObject *nameobj;
char *name;
struct hostent *h;
#ifdef ENABLE_IPV6
@@ -3185,8 +3243,10 @@ socket_gethostbyname_ex(PyObject *self,
#endif
#endif /* HAVE_GETHOSTBYNAME_R */
- if (!PyArg_ParseTuple(args, "et:gethostbyname_ex", "idna", &name))
+ if (!PyArg_ParseTuple(args, "O&:gethostbyname_ex",
+ hostname_converter, &nameobj))
return NULL;
+ name = PyBytes_AS_STRING(nameobj);
if (setipaddr(name, (struct sockaddr *)&addr, sizeof(addr), AF_INET) < 0)
goto finally;
Py_BEGIN_ALLOW_THREADS
@@ -3219,7 +3279,7 @@ socket_gethostbyname_ex(PyObject *self,
PyThread_release_lock(netdb_lock);
#endif
finally:
- PyMem_Free(name);
+ Py_DECREF(nameobj);
return ret;
}
@@ -3242,6 +3302,7 @@ socket_gethostbyaddr(PyObject *self, PyO
struct sockaddr_in addr;
#endif
struct sockaddr *sa = (struct sockaddr *)&addr;
+ PyObject *addrobj;
char *ip_num;
struct hostent *h;
PyObject *ret = NULL;
@@ -3266,8 +3327,10 @@ socket_gethostbyaddr(PyObject *self, PyO
int al;
int af;
- if (!PyArg_ParseTuple(args, "et:gethostbyaddr", "idna", &ip_num))
+ if (!PyArg_ParseTuple(args, "O&:gethostbyaddr",
+ hostname_converter, &addrobj))
return NULL;
+ ip_num = PyBytes_AS_STRING(addrobj);
af = AF_UNSPEC;
if (setipaddr(ip_num, sa, sizeof(addr), af) < 0)
goto finally;
@@ -3315,7 +3378,7 @@ socket_gethostbyaddr(PyObject *self, PyO
PyThread_release_lock(netdb_lock);
#endif
finally:
- PyMem_Free(ip_num);
+ Py_DECREF(addrobj);
return ret;
}
@@ -3867,17 +3930,9 @@ socket_getaddrinfo(PyObject *self, PyObj
}
if (hobj == Py_None) {
hptr = NULL;
- } else if (PyUnicode_Check(hobj)) {
- idna = PyObject_CallMethod(hobj, "encode", "s", "idna");
- if (!idna)
- return NULL;
- assert(PyBytes_Check(idna));
+ } else if (hostname_converter(hobj, &idna)) {
hptr = PyBytes_AS_STRING(idna);
- } else if (PyBytes_Check(hobj)) {
- hptr = PyBytes_AsString(hobj);
} else {
- PyErr_SetString(PyExc_TypeError,
- "getaddrinfo() argument 1 must be string or None");
return NULL;
}
if (PyLong_CheckExact(pobj)) {
Accept ASCII/surrogateescape strings in getnameinfo().
diff --git a/Lib/test/test_socket.py b/Lib/test/test_socket.py
--- a/Lib/test/test_socket.py
+++ b/Lib/test/test_socket.py
@@ -361,6 +361,10 @@ class GeneralModuleTests(unittest.TestCa
self.tryHostnameArgs(lambda host: socket.getaddrinfo(host, None),
socket.gaierror)
+ def testGetnameinfoHostnames(self):
+ self.tryHostnameArgs(lambda host: socket.getnameinfo((host, 0), 0),
+ socket.gaierror)
+
def testSocketObjectHostnames(self):
def f(host):
s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
diff --git a/Modules/socketmodule.c b/Modules/socketmodule.c
--- a/Modules/socketmodule.c
+++ b/Modules/socketmodule.c
@@ -4012,6 +4012,7 @@ socket_getnameinfo(PyObject *self, PyObj
{
PyObject *sa = (PyObject *)NULL;
int flags;
+ PyObject *hostobj;
char *hostp;
int port, flowinfo, scope_id;
char hbuf[NI_MAXHOST], pbuf[NI_MAXSERV];
@@ -4027,9 +4028,10 @@ socket_getnameinfo(PyObject *self, PyObj
"getnameinfo() argument 1 must be a tuple");
return NULL;
}
- if (!PyArg_ParseTuple(sa, "si|ii",
- &hostp, &port, &flowinfo, &scope_id))
+ if (!PyArg_ParseTuple(sa, "O&i|ii", hostname_converter,
+ &hostobj, &port, &flowinfo, &scope_id))
return NULL;
+ hostp = PyBytes_AS_STRING(hostobj);
PyOS_snprintf(pbuf, sizeof(pbuf), "%d", port);
memset(&hints, 0, sizeof(hints));
hints.ai_family = AF_UNSPEC;
@@ -4080,6 +4082,7 @@ socket_getnameinfo(PyObject *self, PyObj
fail:
if (res)
freeaddrinfo(res);
+ Py_DECREF(hostobj);
return ret;
}
_______________________________________________
Python-bugs-list mailing list
Unsubscribe:
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com