All,

I've got 3 web servers in AWS EC2 connected to a pair of back-end Tomcat instances using mod_jk. I'm also using stunnel. So the connections look like this:

ALB -> web server -> [AJP over stunnel] -> Tomcat AjpNioProtocol

One of these web servers is newly-built and it seems to be suffering from connection errors. This is a load-balancer connection using sticky sessions to the back-end nodes. The jk-status page shows e.g. a small number of errors:

Name    Act     State   Err
node1   ACT     OK      5
node2   ACT     OK      6

Note: these are NOT "client errors". I pretty much always ignore those.

This number is small because I reset the balancer member stats this morning to get a better handle on how often the errors occur. It's not all that often, but the other two web servers are pretty much never registering ANY errors. So this is definitely a problem I'd like to solve.

Some notable differences between the existing web servers and the new one:

1. Old web servers are x86-64 based, the new one is aarch64
2. New web server goes through an AWS NAT gateway for IPv4

The new web server has only IPv6 public, and uses a NAT gateway to get out to the internet over IPv4. The other two web servers have public IPv4 addresses. Neither back-end server has IPv6, so all communication for AJP/stunnel is over IPv4.

The mod_jk log contains logs like these:

[Thu Oct 30 13:46:15.893 2025] [aQNsJ08YUCb0jK7JrtZe3gAAAAA] [1447:281472310505664] [info] ajp_connection_tcp_get_message::jk_ajp_common.c (1376): (node1) can't receive the response header message from tomcat, network problems or tomcat (127.0.0.1:7015) is down (errno=104)

[Thu Oct 30 13:46:15.893 2025] [aQNsJ08YUCb0jK7JrtZe3gAAAAA] [1447:281472310505664] [error] ajp_get_reply::jk_ajp_common.c (2346): (node1) Tomcat is down or refused connection. No response has been sent to the client (yet)

[Thu Oct 30 13:46:15.893 2025] [aQNsJ08YUCb0jK7JrtZe3gAAAAA] [1447:281472310505664] [info] ajp_service::jk_ajp_common.c (2892): (node1) sending request to tomcat failed (recoverable), (attempt=1)

[Thu Oct 30 13:46:15.995 2025] [aQNsJ08YUCb0jK7JrtZe3gAAAAA] [1447:281472310505664] [info] ajp_connection_tcp_get_message::jk_ajp_common.c (1376): (node1) can't receive the response header message from tomcat, network problems or tomcat (127.0.0.1:7015) is down (errno=104)

[Thu Oct 30 13:46:15.995 2025] [aQNsJ08YUCb0jK7JrtZe3gAAAAA] [1447:281472310505664] [error] ajp_get_reply::jk_ajp_common.c (2346): (node1) Tomcat is down or refused connection. No response has been sent to the client (yet)

[Thu Oct 30 13:46:15.995 2025] [aQNsJ08YUCb0jK7JrtZe3gAAAAA] [1447:281472310505664] [info] ajp_service::jk_ajp_common.c (2892): (node1) sending request to tomcat failed (recoverable), (attempt=2)

[Thu Oct 30 13:46:15.995 2025] [aQNsJ08YUCb0jK7JrtZe3gAAAAA] [1447:281472310505664] [error] ajp_service::jk_ajp_common.c (2913): (node1) connecting to tomcat failed (rc=0, errors=6, client_errors=2).

[Thu Oct 30 13:46:15.997 2025] [aQNsJ08YUCb0jK7JrtZe3gAAAAA] [1447:281472310505664] [info] service::jk_lb_worker.c (1602): service failed, worker node1 is in local error state

127.0.0.1:7015 is the port number where stunnel is listening.

stunnel contains these logs:

Oct 30 13:46:15 ip-10-2-0-166.ec2.internal stunnel[1444]: LOG3[423]: SSL_read: ssl/record/rec_layer_s3.c:689: error:0A000126:SSL routines::unexpected eof while reading

Oct 30 13:46:15 ip-10-2-0-166.ec2.internal stunnel[1444]: LOG3[373]: SSL_read: ssl/record/rec_layer_s3.c:689: error:0A000126:SSL routines::unexpected eof while reading

Oct 30 13:46:15 ip-10-2-0-166.ec2.internal stunnel[1444]: LOG3[375]: SSL_read: ssl/record/rec_layer_s3.c:689: error:0A000126:SSL routines::unexpected eof while reading

Oct 30 13:46:15 ip-10-2-0-166.ec2.internal stunnel[1444]: LOG3[374]: SSL_read: ssl/record/rec_layer_s3.c:689: error:0A000126:SSL routines::unexpected eof while reading

Given that these timestamps are correlated, it seems that they are reporting the same event.

When stunnel reports "unexpected eof" it typically means that the remote server (or some network gear) closed the connection without tearing-down the TLS connection cleanly.

That symptom, plus the "this is the only server using a NAT gateway" would surely point to one place: the NAT gateway is killing connections that are idle and surprising both stunnel and mod_jk. I can also see a graph of non-zero numbers of "Idle Timeouts" on the NAT gateway. It doesn't tell me more details about those timeouts, but they are almost certainly outgoing AJP/stunnel connections.

But.

Here is my mod_jk workers configuration:

# Template worker
worker.template.type=ajp13
worker.template.host=localhost
worker.template.connection_pool_timeout=60
worker.template.socket_timeout=300
worker.template.max_packet_size=65536

worker.node1.reference=worker.template
worker.node1.port=7015
worker.node1.route=node1

My expectation is that connection_pool_timeout of 60 (seconds) will close connections which have been idle for 60 seconds. If mod_jk closes a connection, stunnel will also close that connection. (Note: I have no explicit connectionTimeout or keepAliveTimeout on the Tomcat side. But this doesn't seem to be any problem for the other two web servers.)

Checking my configuration for the NAT gateway, it has a fixed idle timeout of 350 seconds, which is much longer than the 60 seconds I (believe I) have set for idle AJP connections.

I do not use servlet async or Websocket for anything in my application, so I do not expect long-lasting connections between client and server.

Is there anything I haven't checked at this point?

-chris


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to