I have a system scenario where thousands of applications are running and via a service discovery mechanism they all get notified that a service they are all interesting in has come online. They all attempt to connect a TCP socket to the service. This happen virtually instantly.
The problem that I see is that many of the applications that try to connect to the server get themselves into a state where they are consuming a lot of CPU. I am using Python 3.4.2, asyncio and have set the server backlog set to 4000 in an effort to accomodate the connection request backlog. I am actually using an event loop from aiozmq (but no ZMQ sockets in this scenaio) but under the covers this is just using epoll so it should really be the same as using the DefaultSelector. Using strace on the apps exhibiting issues I see that a socket is continuously triggering a POLLERR|POLLHUP event. This is the cause of the large CPU usage. The socket is the one that was attempting to connect to the new service that was just brought up. I am guessing that the POLLHUP is caused by the server having issues processing the volume of connect requests. I think I need to drop/close the socket causing the POLLHUP. However, from looking through the asyncio source code I don't see how I can do that from within the _selector.select() or _process_events() functions with only the knowledge of which fd is causing the issue. How do poll errors propagate up from the select loop? I can potentially unregister the fd but I don't think this will trigger the transport/protocol getting closed (as far as I can tell) which prevents my normal error handling scenarios from attempting to reconnect to the service. The asyncio select functions seem to ignore events other than EVENT_READ and EVENT_WRITE. Any help would be appreciated. Regards, Chris
-- https://mail.python.org/mailman/listinfo/python-list