On Mon, Jan 9, 2017 at 12:01 PM, Erik Bray <erik.m.b...@gmail.com> wrote: > On Fri, Jan 6, 2017 at 12:40 PM, Erik Bray <erik.m.b...@gmail.com> wrote: >> Hello, and happy new-ish year, >> >> I've been working on and off over the past few months on bringing >> Python's compatibility with Cygwin up to snuff, including having all >> pertinent tests passing. I've noticed that there are several tests >> (which I currently skip) that cause the process to hang indefinitely, >> and not respond to any signals from Cygwin (it can only be killed from >> Windows). This is Cygwin 64-bit--I have not tested 32-bit. >> >> I finally looked into this problem and found the lockup to be in >> pselect() somewhere. Attached I've provided the most minimal example >> I've been able to come up with so far that reproduces the problem, >> which I'll describe in a bit more detail next. I would attach a >> cygcheck output if requested, but I was also able to reproduce this on >> a recent build from source. >> >> So far as I've been able to tell, the problem only occurs with AF_UNIX >> sockets. In the example I have a 'server' socket and a 'client' >> socket both set to non-blocking. The client connects to the socket, >> returning errno EINPROGRESS as expected. Then I do a pselect on the >> client socket to wait until it is ready to be read from. The hang >> only happens when I pselect on the client socket, and not on the >> server socket. It doesn't seem to make a difference what the timeout >> is. One thing I have no tried is if the client and server are >> actually different processes, but the example from the Python tests >> this is reproducing is where they are both in the same process. >> >> Below is (I think) the most relevant output from strace on the test >> case. It seems to hang somewhere in socket_cleanup, but I haven't >> investigated any further than that. > > I made a little bit of progress debugging this, but now I'm stumped. > It seems the problem is this: > > For each socket whose fd is passed to select() a thread_socket is > started which calls peek_socket until there are bits ready on the > socket, or until the timeout is reached. This in turn calls > fhandler_socket::evaluate_events. > > The reason it's only locking up on my "client thread" on which > connect() is called, is that evaluate_events notes that the socket is > waiting to connect, and this passes control to > fhandler_socket::af_local_connect(). af_local_connect() temporarily > sets the socket to blocking, then sends a magic string to the socket > (you can see in my strace log that this succeeds). What's strange, > and what I don't understand, is that there are no FD_READ or FD_OOB > events recorded for the WSASendTo call from af_local_send_secret(). > Then, after af_local_send_secret() it calls af_local_recv_secret(). > This calls recv_internal() which in turn calls recursively into > fhandler_socket::evaluate_events where it waits for an FD_READ or > FD_OOB event that never arrives. And since it set the socket to > blocking it just sits in an infinite loop. > > Meanwhile the timer for the select() call expires and tries to shut > down the thread_socket but it can't because it never completes. > > What I don't understand is why there is not an event recorded for the > WSASendTo in send_internal. I even wrapped it with the following > debug code to wait for an FD_READ event immediately following the > WSASendTo: > > else if (get_socket_type () == SOCK_STREAM) > { > WSAEventSelect(get_socket (), wsock_evt, EVENT_MASK); > res = WSASendTo (get_socket (), out_buf, out_idx, &ret, flags, > wsamsg->name, wsamsg->namelen, NULL, NULL); > debug_printf("WSASendTo sent %d bytes; ret: %d", ret, res); > while (!(res=wait_for_events (FD_READ | FD_OOB, 0))) { > debug_printf("Waiting for socket to be readable"); > } > } > > > > But the strace at this point just outputs: > 62 108286 [socksel] poll_test 24152 > fhandler_socket::af_local_connect: af_local_connect called, > no_getpeereid=0 > 156 108442 [socksel] poll_test 24152 > fhandler_socket::send_internal: WSASendTo sent 16 bytes; ret: 0 > > It never returns from send_internal. I don't have deep knowledge of > WinSock, but from what I've read ISTM WSASendTo should have triggered > an FD_READ event on the socket, and it doesn't for some reason.
After playing around with this a bit more I came up with a much simpler example. This has nothing to do with select( ) at all, directly. The simplified example is just: #include <arpa/inet.h> #include <sys/socket.h> #include <string.h> #include <stdio.h> #include <sys/un.h> #include <errno.h> int main(void) { fd_set rfds; int sock_server, sock_client; int retval; struct sockaddr_un addr; memset(&addr, 0, sizeof(addr)); addr.sun_family = AF_UNIX; strcpy(addr.sun_path, "@test.sock"); sock_server = socket(AF_UNIX, SOCK_STREAM, 0); if (bind(sock_server, (struct sockaddr*)&addr, sizeof(addr))) { printf("binding server socket failed"); return 1; } retval = listen(sock_server, 5); printf("Ret from listen: %d\n", retval); sock_client = socket(AF_UNIX, SOCK_STREAM, 0); retval = connect(sock_client, (struct sockaddr*)&addr, sizeof(addr)); printf("Ret from client connect: %d; errno: %d\n", retval, errno); return 0; } On Linux this example works as I expect, and the connect() call returns immediately. However, on Cygwin the connect() call hangs after af_local_send_secret(), as described in my first message. However, when I split this example up into separate client and server processes it works as expected and the connect() is properly negotiated and returns immediately. -- Problem reports: http://cygwin.com/problems.html FAQ: http://cygwin.com/faq/ Documentation: http://cygwin.com/docs.html Unsubscribe info: http://cygwin.com/ml/#unsubscribe-simple