I think the fact that you're getting AMQ219014 explains most, if not all, of the issues. The "packet 71" that is timing out here is the packet used to send a message from the core client to the broker. As far as I can tell this is your "lost" message although it's not really lost since you got a clear exception indicating that sending the message failed. This message will not be resent automatically for the same basic reasons I explained on your other thread [1].
Justin [1] https://lists.apache.org/thread/6q7zoczljslbr26v845fycz4lfdgs3nk On Sun, Mar 19, 2023 at 4:03 PM John Lilley <john.lil...@redpointglobal.com.invalid> wrote: > Greetings! > > > > Our app is heavy on the RPC pattern using the Artemis JMS client API. Our > RPC client sends a request to a named service queue, and waits for a reply > on a reply-to queue. This mostly works great, but under load, and in the > Kubernetes environment, something goes wrong and a message is lost. > > > > I am beginning to suspect that that this is due to using a single reply-to > temporary queue for a long lifetime. It is OK to use a temporary queue for > a long time? If there is a connection error (like I think we see below), > should I be creating a new reply queue? > > > > Apologies in advance because I feel like we keep nibbling around the edge > of the same symptoms and never quite getting enough information to > completely resolve it. Hopefully this time we’ve logged enough detail and > uncovered something new that will open the door to an answer. > > > > We recently logged an issue that seems to come down to: > > - There is an error logged at low levels of the AMQ client > - These errors seem to mostly self-recover but not before a 30-second > timeout. > - Our RPC client sends a request message to a named service queue > - Our RPC service receives the request and sends a reply message to > the (temporary) reply-to queue > - ActiveMQ broker log shows the reply SEND but not the DELIVER > - On examination of the AMQ console, the reply-to queue exists, has > one message, but no consumers > > > > On the face of it this seems simple. We must be closing the reply > consumer or session. But we see evidence that something else is going > wrong at a lower level. > > > > I keep scouring our code for where we might be closing a session or > consumer, and I just can’t find anything, because: > > - We create Connections statically and never close them > - We use a singleton reply-to queue and MessageListener, and receive > all replies through this one consumer > - Our singleton MessageListener creates and owns its Session, > Consumer, and Queue, and never closes them > > > > In the log of our RPC client: > > 2023-03-18T03:02:27.290 [Thread-1 (ActiveMQ-client-global-threads)] > client.connectionFailureDetected:293 [] WARN - AMQ212037: Connection > failure to dm-activemq-svc/10.0.202.122:61616 has been detected: > AMQ219014: Timed out after waiting 30000 ms for response when sending > packet 71 [code=CONNECTION_TIMEDOUT] > > 2023-03-18T03:02:27.294 [Thread-1 (ActiveMQ-client-global-threads)] > ClientSessionFactoryImpl.failoverOrReconnect:555 [] DEBUG - Failure > captured on connectionID=c8929f70, performing failover or reconnection now > ActiveMQConnectionTimedOutException[errorType=CONNECTION_TIMEDOUT > message=AMQ219014: Timed out after waiting 30000 ms for response when > sending packet 71] > > … > > 2023-03-18T03:02:27.466 [Thread-1901] > JmsStaticConnectionPool.onException:78 [] ERROR - Receive error occurred. > javax.jms.JMSException: > ActiveMQConnectionTimedOutException[errorType=CONNECTION_TIMEDOUT > message=AMQ219014: Timed out after waiting 30000 ms for response when > sending packet 71] > > You can see more detail at: > https://drive.google.com/file/d/1-ThbTMQAPAJwNUI4HIlmFKXf-yB9p59E/view?usp=share_link > > I’ve got our code set to log a thread stack dump when this happens, so > there is a lot of detail there. > > > > The JmsStaticConnectionPool.onException() is our registered exception > handler. We don’t try to recover or anything in this method, we just log > the error. *Should we be doing something here? *I’ve never been able to > understand under what circumstances we need to actively recover from errors. > > > > Shortly after that, we attempt another RPC call, which receives no reply: > > 2023-03-18T03:02:28.424 [HpaK8S] JmsRpcClientChannel.prepareCall:81 [] > INFO - REQUEST OUT: { "header": > {"version":1,"type":"void","id":"om8xv1gdjvsb","api":"hpa_metrics","method":"get_expected_pod_counts","authorization":"***REDACTED***"}, > "body":{"message_type":"void"} } > > 2023-03-18T03:02:28.427 [HpaK8S] > AbstractProducerCreditsImpl.checkCredits:182 [] DEBUG - not asking for -1 > credits on rpdm.v1.hpa.metrics > > 2023-03-18T03:02:28.427 [HpaK8S] > AbstractProducerCreditsImpl.afterAcquired:90 [] DEBUG - AfterAcquire 679 > credits on address rpdm.v1.hpa.metrics > > > > You can see this is received in our service log and a reply is sent: > > 2023-03-18T03:02:30.292 [Thread-68 (ActiveMQ-client-global-threads)] > JmsServerBase.logRequest:107 [] INFO - REQUEST IN: { "header": > {"version":1,"type":"void","id":"om8xv1gdjvsb","api":"hpa_metrics","method":"get_expected_pod_counts","authorization":"***REDACTED***","chain":[]}, > "body":"***ERROR NO MESSAGE BODY***" } > > 2023-03-18T03:02:30.292 [Thread-68 (ActiveMQ-client-global-threads)] > JmsServerBase.logResponse:114 [om8xv1gdjvsb] INFO - RESPONSE OUT: { > "header": > {"version":1,"type":"expected_pod_counts_response","id":"om8xv1gdjvsb","http_code":200}, > "body":{"counts":[{"message_type":"expected_pod_count_response","pod_type":"project_execution","memory_mb":2048,"jvm_memory_mb":250,"count":1},{"message_type":"expected_pod_count_response","pod_type":"project_execution","memory_mb":6144,"jvm_memory_mb":2000,"count":0},{"message_type":"expected_pod_count_response","pod_type":"project_edit","memory_mb":500,"jvm_memory_mb":500,"count":1},{"message_type":"expected_pod_count_response","pod_type":"automation_edit","memory_mb":500,"jvm_memory_mb":500,"count":0},{"message_type":"expected_pod_count_response","pod_type":"automation_execution","memory_mb":500,"jvm_memory_mb":250,"count":0}],"message_type":"expected_pod_counts_response"} > } > > > > In the AMQ broker log the request SEND and DELIVER: > > 2023-03-17T21:02:28-06:00 2023-03-18 03:02:28,427 INFO > [net.redpoint.rpdm.artemis_logger.RpdmArtemisLogger] SEND: { "header": > {"version":1,"type":"void","id":"om8xv1gdjvsb","api":"hpa_metrics","method":"get_expected_pod_counts","authorization":"Bwuth0M2TKk1Znm+iD7+Pew2hp4Klq5+eTjGuWEjfn4="}, > "body":{"message_type":"void"} } > > 2023-03-17T21:02:30-06:00 2023-03-18 03:02:30,183 INFO > [net.redpoint.rpdm.artemis_logger.RpdmArtemisLogger] DELIVER: { "header": > {"version":1,"type":"void","id":"om8xv1gdjvsb","api":"hpa_metrics","method":"get_expected_pod_counts","authorization":"Bwuth0M2TKk1Znm+iD7+Pew2hp4Klq5+eTjGuWEjfn4="}, > "body":{"message_type":"void"} } > > > > But the response only SEND, no DELIVER: > > 2023-03-17T21:02:30-06:00 2023-03-18 03:02:30,305 INFO > [net.redpoint.rpdm.artemis_logger.RpdmArtemisLogger] SEND: { "header": > {"version":1,"type":"expected_pod_counts_response","id":"om8xv1gdjvsb","http_code":200}, > "body":{"counts":[{"message_type":"expected_pod_count_response","pod_type":"project_execution","memory_mb":2048,"jvm_memory_mb":250,"count":1},{"message_type":"expected_pod_count_response","pod_type":"project_execution","memory_mb":6144,"jvm_memory_mb":2000,"count":0},{"message_type":"expected_pod_count_response","pod_type":"project_edit","memory_mb":500,"jvm_memory_mb":500,"count":1},{"message_type":"expected_pod_count_response","pod_type":"automation_edit","memory_mb":500,"jvm_memory_mb":500,"count":0},{"message_type":"expected_pod_count_response","pod_type":"automation_execution","memory_mb":500,"jvm_memory_mb":250,"count":0}],"message_type":"expected_pod_counts_response"} > } > > > > This is consistent with the reply-to consumer being closed, but I cannot > how or when it is closed. Also in the AMQ console we see the consumer > count is zero and there is one message, so it certainly *acts* like the > reply-to consumer was closed. > > > > I should mention that we see a lot of these in our RPC client log, but > since there so many I assumed they were harmless: > > 2023-03-18T03:02:36.094 [Thread-2] ClientSessionImpl.close:1333 [] DEBUG - > Calling close on session ClientSessionImpl > [name=509b0e84-c539-11ed-911b-86945e7d6721, username=null, closed=false, > factory = > org.apache.activemq.artemis.core.client.impl.ClientSessionFactoryImpl@50f3a373, > metaData=()]@d95cd6d > > 2023-03-18T03:02:36.095 [Thread-2] ClientSessionImpl.doCleanup:2080 [] > DEBUG - calling cleanup on ClientSessionImpl > [name=509b0e84-c539-11ed-911b-86945e7d6721, username=null, closed=false, > factory = > org.apache.activemq.artemis.core.client.impl.ClientSessionFactoryImpl@50f3a373, > metaData=()]@d95cd6d > > > > I looked back 30 seconds in the RPC client log, and didn’t see anything > odd, but there is this unusual flurry: > > 2023-03-18T03:01:57.255 [HpaMetrics-Background] > AbstractProducerCreditsImpl.checkCredits:182 [] DEBUG - not asking for -1 > credits on activemq.management > > 2023-03-18T03:01:57.255 [HpaMetrics-Background] > AbstractProducerCreditsImpl.afterAcquired:90 [] DEBUG - AfterAcquire 413 > credits on address activemq.management > > 2023-03-18T03:01:57.256 [HpaMetrics-Background] > AbstractProducerCreditsImpl.checkCredits:182 [] DEBUG - not asking for -1 > credits on activemq.management > > 2023-03-18T03:01:57.256 [HpaMetrics-Background] > AbstractProducerCreditsImpl.afterAcquired:90 [] DEBUG - AfterAcquire 415 > credits on address activemq.management > > 2023-03-18T03:01:57.257 [HpaMetrics-Background] > AbstractProducerCreditsImpl.checkCredits:182 [] DEBUG - not asking for -1 > credits on activemq.management > > 2023-03-18T03:01:57.257 [HpaMetrics-Background] > AbstractProducerCreditsImpl.afterAcquired:90 [] DEBUG - AfterAcquire 389 > credits on address activemq.management > > 2023-03-18T03:01:57.257 [HpaMetrics-Background] > AbstractProducerCreditsImpl.checkCredits:182 [] DEBUG - not asking for -1 > credits on activemq.management > > 2023-03-18T03:01:57.257 [HpaMetrics-Background] > AbstractProducerCreditsImpl.afterAcquired:90 [] DEBUG - AfterAcquire 395 > credits on address activemq.management > > 2023-03-18T03:01:57.258 [HpaMetrics-Background] > AbstractProducerCreditsImpl.checkCredits:182 [] DEBUG - not asking for -1 > credits on activemq.management > > 2023-03-18T03:01:57.258 [HpaMetrics-Background] > AbstractProducerCreditsImpl.afterAcquired:90 [] DEBUG - AfterAcquire 401 > credits on address activemq.management > > > > Meanwhile, we have a custom logger installed in the AMQ broker, and > everything looks fine until it is not: > > > > [image: rg] <https://www.redpointglobal.com/> > > John Lilley > > Data Management Chief Architect, Redpoint Global Inc. > > 888 Worcester Street, Suite 200 Wellesley, MA 02482 > > *M: *+1 7209385761 <+1%207209385761> | john.lil...@redpointglobal.com > > PLEASE NOTE: This e-mail from Redpoint Global Inc. (“Redpoint”) is > confidential and is intended solely for the use of the individual(s) to > whom it is addressed. If you believe you received this e-mail in error, > please notify the sender immediately, delete the e-mail from your computer > and do not copy, print or disclose it to anyone else. If you properly > received this e-mail as a customer, partner or vendor of Redpoint, you > should maintain its contents in confidence subject to the terms and > conditions of your agreement(s) with Redpoint. >