Hi, Just a follow up on this topic - has anybody faced similar issue with federation mechanism ending with OOM on intermittent tcp connection problems?
Is there anything we can do to maintain stability of our installation? Thanks Michal Od: Michal Balicki Wysłano: wtorek, 3 stycznia 2023 14:43 Do: users@activemq.apache.org Temat: RE: Federation queues - issue with cleaning up resources? Looks images have not been correctly added to original msg- attaching them now. Thanks Michal Balicki Od: Michal Balicki Wysłano: wtorek, 3 stycznia 2023 10:22 Do: users@activemq.apache.org<mailto:users@activemq.apache.org> Temat: Federation queues - issue with cleaning up resources? Hi, In our installation we use Artemis 2.27.1 embedded into SpringBoot 2.7.6. Recently two separate Artemis clusters in different DCs have been joined using federation queues (both downstream and upstream mode). Since that time, we observe heap memory hits from time to time. Looks this could be related to improper handling of dead TCP connections due to any network issues. Following is a snippet of federation queue configuration on nodes where federation is setup: var federationUpstreamConfiguration = new FederationUpstreamConfiguration(); federationUpstreamConfiguration.setName(String.format("federation-upstream-config-for-%s", federationName)); federationUpstreamConfiguration.getConnectionConfiguration() .setShareConnection(true) .setStaticConnectors(Collections.singletonList(connectorName)) .setCircuitBreakerTimeout(5000) .setHA(false) .setClientFailureCheckPeriod(ActiveMQDefaultConfiguration.getDefaultFederationFailureCheckPeriod()) .setConnectionTTL(ActiveMQDefaultConfiguration.getDefaultFederationConnectionTtl()) .setRetryInterval(ActiveMQDefaultConfiguration.getDefaultFederationRetryInterval()) .setRetryIntervalMultiplier(ActiveMQDefaultConfiguration.getDefaultFederationRetryIntervalMultiplier()) .setMaxRetryInterval( ActiveMQDefaultConfiguration.getDefaultFederationMaxRetryInterval()) .setInitialConnectAttempts(ActiveMQDefaultConfiguration.getDefaultFederationInitialConnectAttempts()) .setReconnectAttempts(ActiveMQDefaultConfiguration.getDefaultFederationReconnectAttempts()) .setCallTimeout(ActiveMQClient.DEFAULT_CALL_TIMEOUT) .setCallFailoverTimeout(ActiveMQClient.DEFAULT_CALL_FAILOVER_TIMEOUT); federationUpstreamConfiguration.addPolicyRef(queuePolicyNameUpstream); This is what we observe in the logs: 2023-01-02T16:46:17.212+01:00 2023-01-02 15:46:17.212 [78] WARN o.a.a.a.c.client - AMQ212037: Connection failure to /10.112.62.33:38732 has been detected: AMQ229014: Did not receive data from /10.112.62.33:38732 within the 60000ms connection TTL. The connection will now be closed. [code=CONNECTION_TIMEDOUT] 2023-01-02T16:47:19.215+01:00 2023-01-02 15:47:19.215 [78] WARN o.a.a.a.c.client - AMQ212037: Connection failure to /10.112.62.33:57708 has been detected: AMQ229014: Did not receive data from /10.112.62.33:57708 within the 60000ms connection TTL. The connection will now be closed. [code=CONNECTION_TIMEDOUT] 2023-01-02T16:48:21.217+01:00 2023-01-02 15:48:21.217 [104] WARN o.a.a.a.c.client - AMQ212037: Connection failure to /10.112.62.33:56402 has been detected: AMQ229014: Did not receive data from /10.112.62.33:56402 within the 60000ms connection TTL. The connection will now be closed. [code=CONNECTION_TIMEDOUT] 2023-01-02T16:49:23.220+01:00 2023-01-02 15:49:23.220 [2466] WARN o.a.a.a.c.client - AMQ212037: Connection failure to /10.112.62.33:56760 has been detected: AMQ229014: Did not receive data from /10.112.62.33:56760 within the 60000ms connection TTL. The connection will now be closed. [code=CONNECTION_TIMEDOUT] 2023-01-02T16:50:25.222+01:00 2023-01-02 15:50:25.222 [110] WARN o.a.a.a.c.client - AMQ212037: Connection failure to /10.112.62.33:43358 has been detected: AMQ229014: Did not receive data from /10.112.62.33:43358 within the 60000ms connection TTL. The connection will now be closed. [code=CONNECTION_TIMEDOUT] 2023-01-02T16:51:27.224+01:00 2023-01-02 15:51:27.224 [110] WARN o.a.a.a.c.client - AMQ212037: Connection failure to /10.112.62.33:34982 has been detected: AMQ229014: Did not receive data from /10.112.62.33:34982 within the 60000ms connection TTL. The connection will now be closed. [code=CONNECTION_TIMEDOUT] 2023-01-02T16:52:29.227+01:00 2023-01-02 15:52:29.227 [97] WARN o.a.a.a.c.client - AMQ212037: Connection failure to /10.112.62.33:40942 has been detected: AMQ229014: Did not receive data from /10.112.62.33:40942 within the 60000ms connection TTL. The connection will now be closed. [code=CONNECTION_TIMEDOUT] 2023-01-02T16:53:31.229+01:00 2023-01-02 15:53:31.229 [92] WARN o.a.a.a.c.client - AMQ212037: Connection failure to /10.112.62.33:37634 has been detected: AMQ229014: Did not receive data from /10.112.62.33:37634 within the 60000ms connection TTL. The connection will now be closed. [code=CONNECTION_TIMEDOUT] 2023-01-02T16:54:33.232+01:00 2023-01-02 15:54:33.231 [3397] WARN o.a.a.a.c.client - AMQ212037: Connection failure to /10.112.62.33:58302 has been detected: AMQ229014: Did not receive data from /10.112.62.33:58302 within the 60000ms connection TTL. The connection will now be closed. [code=CONNECTION_TIMEDOUT] 2023-01-02T16:55:35.234+01:00 2023-01-02 15:55:35.234 [78] WARN o.a.a.a.c.client - AMQ212037: Connection failure to /10.112.62.33:59364 has been detected: AMQ229014: Did not receive data from /10.112.62.33:59364 within the 60000ms connection TTL. The connection will now be closed. [code=CONNECTION_TIMEDOUT] 2023-01-02T16:56:37.236+01:00 2023-01-02 15:56:37.236 [97] WARN o.a.a.a.c.client - AMQ212037: Connection failure to /10.112.62.33:40052 has been detected: AMQ229014: Did not receive data from /10.112.62.33:40052 within the 60000ms connection TTL. The connection will now be closed. [code=CONNECTION_TIMEDOUT] 2023-01-02T16:57:39.239+01:00 2023-01-02 15:57:39.239 [105] WARN o.a.a.a.c.client - AMQ212037: Connection failure to /10.112.62.33:38354 has been detected: AMQ229014: Did not receive data from /10.112.62.33:38354 within the 60000ms connection TTL. The connection will now be closed. [code=CONNECTION_TIMEDOUT] 2023-01-02T16:58:41.241+01:00 2023-01-02 15:58:41.241 [98] WARN o.a.a.a.c.client - AMQ212037: Connection failure to /10.112.62.33:42590 has been detected: AMQ229014: Did not receive data from /10.112.62.33:42590 within the 60000ms connection TTL. The connection will now be closed. [code=CONNECTION_TIMEDOUT] 2023-01-02T16:59:43.243+01:00 2023-01-02 15:59:43.243 [88] WARN o.a.a.a.c.client - AMQ212037: Connection failure to /10.112.62.33:35532 has been detected: AMQ229014: Did not receive data from /10.112.62.33:35532 within the 60000ms connection TTL. The connection will now be closed. [code=CONNECTION_TIMEDOUT] Strange is consumer count on upstream nodes that are often far above some reasonable values - following is a scraping from artemis_consumer_count metric on queue that is federated [cid:image001.png@01D92FDD.5300A790] You can see strange increased from 0 to hundreds and then some day later cleanup. As new consumers being created, and old ones not being removed. Similarly, from time to time we observe enormous number of sessions being maintained on upstream nodes - e.g. 820 on single connection. [cid:image002.png@01D92FDD.5300A790] When drill in you can see majority sessions on this connection have been created at the same time. [cid:image003.png@01D92FDD.5300A790] Then the only way to get rid of this is to close manually the connection from the console. Heap usage: [cid:image004.png@01D92FDD.5300A790] When analysing memory dump using MAT following is reported: Problem Suspect 1 One instance of "org.apache.activemq.artemis.core.client.impl.ServerLocatorImpl" loaded by "org.springframework.boot.loader.LaunchedURLClassLoader @ 0xb02ba788" occupies 65 125 032 (34,91%) bytes. The memory is accumulated in one instance of "java.util.HashMap$Node[]", loaded by "<system class loader>", which occupies 65 119 184 (34,90%) bytes. Keywords org.apache.activemq.artemis.core.client.impl.ServerLocatorImpl org.springframework.boot.loader.LaunchedURLClassLoader @ 0xb02ba788 java.util.HashMap$Node[] java.util.HashMap$Node[256] @ 0xb6cfca48 1 040 65 119 184 \table java.util.HashMap @ 0xb1bc46c8 48 65 119 248 .\map java.util.HashSet @ 0xb1bc46b8 16 65 119 264 ..\factories org.apache.activemq.artemis.core.client.impl.ServerLocatorImpl @ 0xb1bc4620 136 65 125 032 ...+serverLocator org.apache.activemq.artemis.core.server.federation.FederationConnection @ 0xb1bc45f0 48 48 ...|\connection org.apache.activemq.artemis.core.server.federation.FederationUpstream @ 0xb1bc4530 48 480 ...|.\upstream org.apache.activemq.artemis.core.server.federation.queue.FederatedQueue @ 0xb1bc43a0 56 7 392 ...|..\[2] java.lang.Object[4] @ 0xb16fe5b8 32 48 ...|...\array java.util.concurrent.CopyOnWriteArrayList @ 0xb16fe590 24 88 ...|....\brokerPlugins org.apache.activemq.artemis.core.config.impl.FileConfiguration @ 0xb16f7460 592 7 272 ...|.....\configuration org.apache.activemq.artemis.core.server.impl.ActiveMQServerImpl @ 0xb16f7208 280 4 168 ...|......+server org.apache.activemq.artemis.core.remoting.server.impl.RemotingServiceImpl @ 0xb3387f60 96 1 952 ...|......|\this$0 org.apache.activemq.artemis.core.remoting.server.impl.RemotingServiceImpl$FailureCheckAndFlushThread @ 0xb33ce5f8 activemq-failure-check-thread Thread 128 336 ...|......+server org.apache.activemq.artemis.core.server.impl.ServerStatus @ 0xb1bc8d38 > 24 1 184 ...|......\Total: 2 entries ...+serverLocator org.apache.activemq.artemis.core.client.impl.ClientSessionFactoryImpl @ 0xd090a7c8 > org.apache.activemq.artemis.core.client.impl.ServerLocatorImpl @ 0xb1bc4620 136 65 125 032 34,91% \java.util.HashSet @ 0xb1bc46b8 16 65 119 264 34,90% .\java.util.HashMap @ 0xb1bc46c8 48 65 119 248 34,90% ..\java.util.HashMap$Node[256] @ 0xb6cfca48 1 040 65 119 184 34,90% ...+java.util.HashMap$Node @ 0xb59106b8 32 236 464 0,13% ...+java.util.HashMap$Node @ 0xb620b438 32 181 944 0,10% ...+java.util.HashMap$Node @ 0xb3e5c6a0 32 169 344 0,09% ...+java.util.HashMap$Node @ 0xb3bb0cd8 32 147 120 0,08% ...+java.util.HashMap$Node @ 0xb5169f88 32 139 184 0,07% ...+java.util.HashMap$Node @ 0xb4456400 32 131 296 0,07% ...+java.util.HashMap$Node @ 0xb59e6668 32 126 616 0,07% ...+java.util.HashMap$Node @ 0xb58818a0 32 121 504 0,07% ...+java.util.HashMap$Node @ 0xb5839f40 32 120 560 0,06% ...+java.util.HashMap$Node @ 0xb84fd158 32 120 496 0,06% ...+java.util.HashMap$Node @ 0xb58104a8 32 118 000 0,06% ...+java.util.HashMap$Node @ 0xb5947bd8 32 115 920 0,06% ...+java.util.HashMap$Node @ 0xb84e3be8 32 115 024 0,06% ...+java.util.HashMap$Node @ 0xb8488828 32 113 400 0,06% ...+java.util.HashMap$Node @ 0xb3bc5258 32 110 704 0,06% ...+java.util.HashMap$Node @ 0xb3e2cdc8 32 110 208 0,06% ...+java.util.HashMap$Node @ 0xb57bda48 32 109 688 0,06% ...+java.util.HashMap$Node @ 0xb57d2880 32 109 232 0,06% ...+org.apache.activemq.artemis.core.client.impl.ClientSessionFactoryImpl @ 0xb62518c8 192 106 944 0,06% ...+org.apache.activemq.artemis.core.client.impl.ClientSessionFactoryImpl @ 0xd0b00a40 192 98 336 0,05% ...\Total: 20 entries 960 2 601 984 1,39% When removing federation config on selected node, memory consumption on this node comes back to normal. Thanks Michal Balicki Confidentiality Notice: This message and any included attachments are from EML and are intended only for the addressee(s). The information contained in this message is confidential and may constitute inside or non-public information under international, federal or state laws. Unauthorized forwarding, printing, copying, distribution or use of such information is strictly prohibited and may be unlawful. If you are not the addressee, please promptly delete this message and notify the sender of the delivery error by email. Thank you for your cooperation.