I would recommend a couple of things to narrow down the issue... First, have a script or something on the client gather thread dumps at regular intervals (e.g. every 5 seconds) starting right before you kill the pod and continuing until the client fails over and recovers.
Second, turn on TRACE logging for org.apache.activemq.artemis.core.protocol.core.impl.ChannelImpl. This will print details about every packet the client sends. You can use those log statements in correlation with the "REQUEST OUT" statements to see if there's some kind of delay between when you think you're sending a packet and when the packet is actually sent. Generally speaking I'd also recommend using a clientFailureCheckPeriod that's smaller than your connectionTTL. Currently you're using 10000 for both of these which may result in the broker closing your connection unnecessarily. I recommend you use clientFailureCheckPeriod=5000 instead. Justin On Wed, Feb 21, 2024 at 12:45 PM John Lilley <[email protected]> wrote: > Greetings! > > > > We are having a devil of a time trying to reduce delay during a failover > event. We’ve set our URL to > > > (tcp://dm-activemq-live-svc:61616,tcp://dm-activemq-backup-svc:61617)?ha=true&reconnectAttempts=200&initialConnectAttempts=200&clientFailureCheckPeriod=10000&connectionTTL=10000&callTimeout=10000 > > > > But nothing seems to reduce the time it takes for a sender to detect the > issue and failover. With or without these parameters, it takes a full > minute to recover after failover. Are there other parameters to adjust? > Something in the broker? Is there some internal retry loop and timer that > waits longer that we can influence? > > > > We are testing failover by killing one of the AMQ pods. This almost > always succeeds without issue, except occasionally it doesn’t. > > > > The backup broker log shows that it assumes control very quickly > > 2024-02-16 22:42:56,523 INFO [org.apache.activemq.artemis.core.server] > AMQ221007: Server is now live > > 2024-02-16 22:42:56,533 INFO [org.apache.activemq.artemis.core.server] > AMQ221020: Started EPOLL Acceptor at 0.0.0.0:61617 for protocols > [CORE,MQTT,AMQP,STOMP,HORNETQ,OPENWIRE] > > 2024-02-16 22:42:58,074 INFO > [net.redpoint.rpdm.artemis_logger.RpdmArtemisLogger] SEND: HEADER= > {"version":1,"type":"void","id":"jagbjffeu3of","api":"hpa_metrics","method":"get_expected_pod_counts","authorization":"Q979k6lzOu9KmqMA89GbvmZnIAMNpJJP/TEAAa7Yjpo="}, > BODY={"message_type":"void"} > > 2024-02-16 22:42:58,254 INFO > [net.redpoint.rpdm.artemis_logger.RpdmArtemisLogger] DELIVER: HEADER= > {"version":1,"type":"void","id":"jagbjffeu3of","api":"hpa_metrics","method":"get_expected_pod_counts","authorization":"Q979k6lzOu9KmqMA89GbvmZnIAMNpJJP/TEAAa7Yjpo="}, > BODY={"message_type":"void"} > > > > In our app logs, we show that we are attempting to send a message to the > AMQ broker at the time of failover: > > 2024-02-16T22:42:55.981 [http-nio-9910-exec-9] > JmsRpcClientChannel.prepareCall:84 [9v2zwvclclrc] INFO - REQUEST OUT: { > "header": > {"version":1,"type":"get_task_status_request","id":"9v2zwvclclrc","api":"test_harness","method":"get_task_status","instance":"combined","authorization":"***REDACTED***"}, > "body":{"id":"d1123865-ac47-4238-ab02-5f2324a43264","progress_start_index":0,"message_type":"get_task_status_request"} > } > > > > But *then it takes 40 seconds for the “10000ms” timeout to happen.* > > 2024-02-16T22:43:35.988 [http-nio-9910-exec-9] JmsProducerPool.send_:376 > [9v2zwvclclrc] WARN - Error sending message, will retry > javax.jms.JMSException: AMQ219014: Timed out after waiting 10000 ms for > response when sending packet 71 > > > > Our problem is really that this 40-second delay seems to push everything > back so that recovery takes over a minute. Once we get into the one-minute > mark, we start hitting several timeouts, like our own RPC timeout setting > and the nginx ingress controller for K8S. How can we know why this takes > so long? We’ve set every timeout we can find for the AMQ client to 10 > seconds. Is there some other setting we need to adjust? > > > > > > Meanwhile the primary broker pod has returned, and the backup decides it > is not needed: > > 2024-02-16 22:43:16,115 INFO [org.apache.activemq.artemis.core.server] > AMQ221002: Apache ActiveMQ Artemis Message Broker version 2.31.2 > [10952195-b6ec-11ee-9c87-aa03cb64206a] stopped, uptime 16 minutes > > 2024-02-16 22:43:16,115 INFO [org.apache.activemq.artemis.core.server] > AMQ221039: Restarting as Replicating backup server after live restart > > > > But our app is waiting for its response on the reply-to queue, and gets > this error nearly a minute later: > > 2024-02-16T22:43:58.033 [Thread-6] JmsStaticConnectionPool.onException:78 > [] ERROR - Receive error occurred. javax.jms.JMSException: > ActiveMQDisconnectedException[errorType=DISCONNECTED message=AMQ219015: The > connection was disconnected because of server shutdown] > > > > This is the stack trace at time of the first timeout: > > at > org.apache.activemq.artemis.core.protocol.core.impl.ChannelImpl.sendBlocking(ChannelImpl.java:550) > > at > org.apache.activemq.artemis.core.protocol.core.impl.ChannelImpl.sendBlocking(ChannelImpl.java:452) > > at > org.apache.activemq.artemis.core.protocol.core.impl.ActiveMQSessionContext.sendFullMessage(ActiveMQSessionContext.java:588) > > at > org.apache.activemq.artemis.core.client.impl.ClientProducerImpl.sendRegularMessage(ClientProducerImpl.java:305) > > at > org.apache.activemq.artemis.core.client.impl.ClientProducerImpl.doSend(ClientProducerImpl.java:277) > > at > org.apache.activemq.artemis.core.client.impl.ClientProducerImpl.send(ClientProducerImpl.java:147) > > at > org.apache.activemq.artemis.core.client.impl.ClientProducerImpl.send(ClientProducerImpl.java:129) > > at > org.apache.activemq.artemis.jms.client.ActiveMQMessageProducer.doSendx(ActiveMQMessageProducer.java:483) > > at > org.apache.activemq.artemis.jms.client.ActiveMQMessageProducer.send(ActiveMQMessageProducer.java:221) > > at > net.redpoint.ipc.jms.JmsProducerPool.send_(JmsProducerPool.java:372) > > at > net.redpoint.ipc.jms.JmsProducerPool.sendRequest(JmsProducerPool.java:301) > > at > net.redpoint.ipc.jms.JmsRpcClientChannel.sendRequest(JmsRpcClientChannel.java:228) > > at > net.redpoint.ipc.jms.JmsRpcClientChannel.invokeRaw(JmsRpcClientChannel.java:202) > > at > net.redpoint.ipc.jms.JmsRpcClientChannel.call(JmsRpcClientChannel.java:101) > > at > net.redpoint.ipc.clients.RpcClientBase._sync(RpcClientBase.java:169) > > at > net.redpoint.ipc.clients.RpcClientBase._rpc(RpcClientBase.java:237) > > at > net.redpoint.rpdm.ipc.clients.TestHarnessClient.getTaskStatus(TestHarnessClient.java:229) > > at > net.redpoint.rpdm.ipc.web_service_gateway.TestHarnessHttpServer.lambda$getTaskStatus$25(TestHarnessHttpServer.java:390) > > at > net.redpoint.ipc.SecurityControl.doAsNoThrow(SecurityControl.java:272) > > at > net.redpoint.rpdm.ipc.web_service_gateway.TestHarnessHttpServer.lambda$getTaskStatus$26(TestHarnessHttpServer.java:390) > > at > net.redpoint.ipc.InstanceControl.doAsNoThrow(InstanceControl.java:84) > > at > net.redpoint.rpdm.ipc.web_service_gateway.TestHarnessHttpServer.getTaskStatus(TestHarnessHttpServer.java:390) > > at > jdk.internal.reflect.GeneratedMethodAccessor25.invoke(Unknown Source) > > at > java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > > at > java.base/java.lang.reflect.Method.invoke(Method.java:568) > > at > org.glassfish.jersey.server.model.internal.ResourceMethodInvocationHandlerFactory.lambda$static$0(ResourceMethodInvocationHandlerFactory.java:52) > > at > org.glassfish.jersey.server.model.internal.AbstractJavaResourceMethodDispatcher$1.run(AbstractJavaResourceMethodDispatcher.java:146) > > at > org.glassfish.jersey.server.model.internal.AbstractJavaResourceMethodDispatcher.invoke(AbstractJavaResourceMethodDispatcher.java:189) > > at > org.glassfish.jersey.server.model.internal.JavaResourceMethodDispatcherProvider$TypeOutInvoker.doDispatch(JavaResourceMethodDispatcherProvider.java:219) > > at > org.glassfish.jersey.server.model.internal.AbstractJavaResourceMethodDispatcher.dispatch(AbstractJavaResourceMethodDispatcher.java:93) > > at > org.glassfish.jersey.server.model.ResourceMethodInvoker.invoke(ResourceMethodInvoker.java:478) > > at > org.glassfish.jersey.server.model.ResourceMethodInvoker.apply(ResourceMethodInvoker.java:400) > > at > org.glassfish.jersey.server.model.ResourceMethodInvoker.apply(ResourceMethodInvoker.java:81) > > at > org.glassfish.jersey.server.ServerRuntime$1.run(ServerRuntime.java:261) > > at > org.glassfish.jersey.internal.Errors$1.call(Errors.java:248) > > at > org.glassfish.jersey.internal.Errors$1.call(Errors.java:244) > > at > org.glassfish.jersey.internal.Errors.process(Errors.java:292) > > at > org.glassfish.jersey.internal.Errors.process(Errors.java:274) > > at > org.glassfish.jersey.internal.Errors.process(Errors.java:244) > > at > org.glassfish.jersey.process.internal.RequestScope.runInScope(RequestScope.java:265) > > at > org.glassfish.jersey.server.ServerRuntime.process(ServerRuntime.java:240) > > at > org.glassfish.jersey.server.ApplicationHandler.handle(ApplicationHandler.java:697) > > at > org.glassfish.jersey.servlet.WebComponent.serviceImpl(WebComponent.java:394) > > at > org.glassfish.jersey.servlet.WebComponent.service(WebComponent.java:346) > > at > org.glassfish.jersey.servlet.ServletContainer.service(ServletContainer.java:357) > > at > org.glassfish.jersey.servlet.ServletContainer.service(ServletContainer.java:311) > > at > org.glassfish.jersey.servlet.ServletContainer.service(ServletContainer.java:205) > > at > org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:205) > > at > org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:149) > > at > org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:167) > > at > org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:90) > > at > org.apache.catalina.authenticator.AuthenticatorBase.invoke(AuthenticatorBase.java:482) > > at > org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:115) > > at > org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:93) > > at > org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:74) > > at > org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:340) > > at > org.apache.coyote.http11.Http11Processor.service(Http11Processor.java:391) > > at > org.apache.coyote.AbstractProcessorLight.process(AbstractProcessorLight.java:63) > > at > org.apache.coyote.AbstractProtocol$ConnectionHandler.process(AbstractProtocol.java:896) > > at > org.apache.tomcat.util.net.NioEndpoint$SocketProcessor.doRun(NioEndpoint.java:1744) > > at > org.apache.tomcat.util.net.SocketProcessorBase.run(SocketProcessorBase.java:52) > > at > org.apache.tomcat.util.threads.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1191) > > at > org.apache.tomcat.util.threads.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:659) > > at > org.apache.tomcat.util.threads.TaskThread$WrappingRunnable.run(TaskThread.java:61) > > at java.base/java.lang.Thread.run(Thread.java:833) > > Caused by: > ActiveMQConnectionTimedOutException[errorType=CONNECTION_TIMEDOUT > message=AMQ219014: Timed out after waiting 10000 ms for response when > sending packet 71] > > ... 65 more > > > > > > This is the stack trace at the second timeout message > > at > org.apache.activemq.artemis.jms.client.ActiveMQConnection$JMSFailureListener.connectionFailed(ActiveMQConnection.java:714) > > at > org.apache.activemq.artemis.jms.client.ActiveMQConnection$JMSFailureListener.connectionFailed(ActiveMQConnection.java:735) > > at > org.apache.activemq.artemis.core.client.impl.ClientSessionFactoryImpl.callSessionFailureListeners(ClientSessionFactoryImpl.java:868) > > at > org.apache.activemq.artemis.core.client.impl.ClientSessionFactoryImpl.callSessionFailureListeners(ClientSessionFactoryImpl.java:856) > > at > org.apache.activemq.artemis.core.client.impl.ClientSessionFactoryImpl.failoverOrReconnect(ClientSessionFactoryImpl.java:802) > > at > org.apache.activemq.artemis.core.client.impl.ClientSessionFactoryImpl.handleConnectionFailure(ClientSessionFactoryImpl.java:566) > > at > org.apache.activemq.artemis.core.client.impl.ClientSessionFactoryImpl$DelegatingFailureListener.connectionFailed(ClientSessionFactoryImpl.java:1407) > > at > org.apache.activemq.artemis.spi.core.protocol.AbstractRemotingConnection.callFailureListeners(AbstractRemotingConnection.java:98) > > at > org.apache.activemq.artemis.core.protocol.core.impl.RemotingConnectionImpl.fail(RemotingConnectionImpl.java:212) > > at > org.apache.activemq.artemis.core.client.impl.ClientSessionFactoryImpl$CloseRunnable.run(ClientSessionFactoryImpl.java:1172) > > at > org.apache.activemq.artemis.utils.actors.OrderedExecutor.doTask(OrderedExecutor.java:57) > > at > org.apache.activemq.artemis.utils.actors.OrderedExecutor.doTask(OrderedExecutor.java:32) > > at > org.apache.activemq.artemis.utils.actors.ProcessorBase.executePendingTasks(ProcessorBase.java:68) > > at > java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) > > at > java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) > > at > org.apache.activemq.artemis.utils.ActiveMQThreadFactory$1.run(ActiveMQThreadFactory.java:118) > > Caused by: ActiveMQDisconnectedException[errorType=DISCONNECTED > message=AMQ219015: The connection was disconnected because of server > shutdown] > > ... 7 more > > > > > > [image: rg] <https://www.redpointglobal.com/> > > John Lilley > > Data Management Chief Architect, Redpoint Global Inc. > > 34 Washington Street, Suite 205 Wellesley Hills, MA 02481 > > *M: *+1 7209385761 <+1%207209385761> | [email protected] > > PLEASE NOTE: This e-mail from Redpoint Global Inc. (“Redpoint”) is > confidential and is intended solely for the use of the individual(s) to > whom it is addressed. If you believe you received this e-mail in error, > please notify the sender immediately, delete the e-mail from your computer > and do not copy, print or disclose it to anyone else. If you properly > received this e-mail as a customer, partner or vendor of Redpoint, you > should maintain its contents in confidence subject to the terms and > conditions of your agreement(s) with Redpoint. >
