Hi Team, My team are facing a no responding issue in the below circumstances:
1. Env: ApacheTomcat:8.5.15, JDK: 1.8.0_121 2. Tomcat configuration: enable APR: protocol="org.apache.coyote.http11.Http11AprProtocol" set maxThreads of the Executor: maxThreads="1200" 3. This web server was under a massive load. All requests were HTTP 1.1 requests and were marked with a "Connection: close" HTTP header. At this point, web server showed some latency for the responses, but it is still running. 4. Some "keep-alive" requests were coming. Those requests were marked with a "Connection: keep-alive" HTTP header. 5. The following non-keep-alive requests was not responding for a long time. We run the thread dump with jstack at this time, we saw Acceptor thread is WAITING: "http-apr-8080-Acceptor@5830" daemon prio=5 tid=0x27 nid=NA waiting java.lang.Thread.State: WAITING at sun.misc.Unsafe.park(Unsafe.java:-1) at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836) at java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:997) at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1304) at org.apache.tomcat.util.threads.LimitLatch.countUpOrAwait(LimitLatch.java:117) at org.apache.tomcat.util.net.AbstractEndpoint.countUpOrAwaitConnection(AbstractEndpoint.java:1292) at org.apache.tomcat.util.net.Acceptor.run(Acceptor.java:83) at java.lang.Thread.run(Thread.java:748) But The Poller thread is RUNNING: "http-apr-8080-Poller@5809" daemon prio=5 tid=0x25 nid=NA runnable java.lang.Thread.State: RUNNABLE at org.apache.tomcat.jni.Poll.poll(Poll.java:-1) at org.apache.tomcat.util.net.AprEndpoint$Poller.run(AprEndpoint.java:1488) at java.lang.Thread.run(Thread.java:748) All the Executor threads are WAITING like this: "http-apr-8080-exec-21@6078" daemon prio=5 tid=0x3f nid=NA waiting java.lang.Thread.State: WAITING at sun.misc.Unsafe.park(Unsafe.java:-1) at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215) at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2078) at java.util.concurrent.LinkedBlockingQueue.poll(LinkedBlockingQueue.java:467) at org.apache.tomcat.util.threads.TaskQueue.poll(TaskQueue.java:89) at org.apache.tomcat.util.threads.TaskQueue.poll(TaskQueue.java:33) at java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1073) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1134) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at org.apache.tomcat.util.threads.TaskThread$WrappingRunnable.run(TaskThread.java:61) at java.lang.Thread.run(Thread.java:748) We dived in the source code, found some clues: 1. Acceptor thread is parking by the LimitLatch. It means LimitLatch is not released by other occupations, which also means the previous connections do not perform the close operation in our circumstances. (AprEndpoint.java#L955) 2. The close logic is take place in the Poller thread. (AprEndpoint.java#L1624) 3. If the polling logic takes lot of time, the Poller thread will be blocked(although it is still running, it could be blocked by the native method), and the destroySocket method will be suspended. (AprEndpoint.java#L1680) 4. Because the Acceptor processes the new connection directly(not registers to the poller, AprEndpoint.java#L2268). So the pollerSpace[i] always equals to actualPollerSize at this case(AprEndpoint.java#L1679), the "nextPollerTime" will be increased so large. But when some "keep-alive" requests arrive, the Handler implementation will process those connections and register each back to the poller (AbstractProtocol.java#L933), so the pollerSpace is changed, and the Poller will use a large value of "nextPollerTime" to poll the socket, so the Poller thread would blocked in a long time. To prove that, we setup a similar environment to reproduce this issue: 1. Use official tomcat docker image(tomcat:8.5.15) to run a simple http server 2. Change the config: <Connector port="8080" protocol="org.apache.coyote.http11.Http11AprProtocol" connectionTimeout="20000" redirectPort="8443" maxConnections="20" /> // make testing easy to reached max connections 3. Keep sending lots of "non-keep-alive" requests in 5 min $ wrk2 -t8 -c32 -d10m -R128 -s ./closed.lua http://127.0.0.1:8080/hello?latency=10 4. Send a single "keep-alive" request and do not close this connection on client side 5. After that, send another "non-keep-alive" request. We can see no response returned in a reasonable time(waiting in 30 sec). A workaround: By set deferAccept="false" for the connector configuration, we can force Acceptor to register the new connection to the poller(AprEndpoint.java#L2250), and "nextPollerTime" will not lose control. So guys, is that a real issue for Tomcat? Best Regards, Chang Liu