ODP: Federation queues - issue with cleaning up resources?

Michal Balicki Tue, 24 Jan 2023 01:20:27 -0800

Hi,
Just a follow up on this topic - has anybody faced similar issue with 
federation mechanism ending with OOM on intermittent tcp connection problems?


Is there anything we can do to maintain stability of our installation?

Thanks
Michal

Od: Michal Balicki
Wysłano: wtorek, 3 stycznia 2023 14:43
Do: users@activemq.apache.org
Temat: RE: Federation queues - issue with cleaning up resources?

Looks images have not been correctly added to original msg- attaching them now.

Thanks
Michal Balicki

Od: Michal Balicki
Wysłano: wtorek, 3 stycznia 2023 10:22
Do: users@activemq.apache.org<mailto:users@activemq.apache.org>
Temat: Federation queues - issue with cleaning up resources?

Hi,
In our installation we use Artemis 2.27.1 embedded into SpringBoot 2.7.6. 
Recently two separate Artemis clusters in different DCs have been joined using 
federation queues (both downstream and upstream mode). Since that time, we 
observe heap memory hits from time to time. Looks this could be related to 
improper handling of dead TCP connections due to any network issues.

Following is a snippet of federation queue configuration on nodes where 
federation is setup:


var federationUpstreamConfiguration = new FederationUpstreamConfiguration();



federationUpstreamConfiguration.setName(String.format("federation-upstream-config-for-%s",
 federationName));

federationUpstreamConfiguration.getConnectionConfiguration()

.setShareConnection(true)

.setStaticConnectors(Collections.singletonList(connectorName))

.setCircuitBreakerTimeout(5000)

.setHA(false)

.setClientFailureCheckPeriod(ActiveMQDefaultConfiguration.getDefaultFederationFailureCheckPeriod())

.setConnectionTTL(ActiveMQDefaultConfiguration.getDefaultFederationConnectionTtl())

.setRetryInterval(ActiveMQDefaultConfiguration.getDefaultFederationRetryInterval())

.setRetryIntervalMultiplier(ActiveMQDefaultConfiguration.getDefaultFederationRetryIntervalMultiplier())

.setMaxRetryInterval( 
ActiveMQDefaultConfiguration.getDefaultFederationMaxRetryInterval())

.setInitialConnectAttempts(ActiveMQDefaultConfiguration.getDefaultFederationInitialConnectAttempts())

.setReconnectAttempts(ActiveMQDefaultConfiguration.getDefaultFederationReconnectAttempts())

.setCallTimeout(ActiveMQClient.DEFAULT_CALL_TIMEOUT)

.setCallFailoverTimeout(ActiveMQClient.DEFAULT_CALL_FAILOVER_TIMEOUT);



federationUpstreamConfiguration.addPolicyRef(queuePolicyNameUpstream);

This is what we observe in the logs:

2023-01-02T16:46:17.212+01:00   2023-01-02 15:46:17.212 [78] WARN 
o.a.a.a.c.client - AMQ212037: Connection failure to /10.112.62.33:38732 has 
been detected: AMQ229014: Did not receive data from /10.112.62.33:38732 within 
the 60000ms connection TTL. The connection will now be closed. 
[code=CONNECTION_TIMEDOUT]
2023-01-02T16:47:19.215+01:00   2023-01-02 15:47:19.215 [78] WARN 
o.a.a.a.c.client - AMQ212037: Connection failure to /10.112.62.33:57708 has 
been detected: AMQ229014: Did not receive data from /10.112.62.33:57708 within 
the 60000ms connection TTL. The connection will now be closed. 
[code=CONNECTION_TIMEDOUT]
2023-01-02T16:48:21.217+01:00   2023-01-02 15:48:21.217 [104] WARN 
o.a.a.a.c.client - AMQ212037: Connection failure to /10.112.62.33:56402 has 
been detected: AMQ229014: Did not receive data from /10.112.62.33:56402 within 
the 60000ms connection TTL. The connection will now be closed. 
[code=CONNECTION_TIMEDOUT]
2023-01-02T16:49:23.220+01:00   2023-01-02 15:49:23.220 [2466] WARN 
o.a.a.a.c.client - AMQ212037: Connection failure to /10.112.62.33:56760 has 
been detected: AMQ229014: Did not receive data from /10.112.62.33:56760 within 
the 60000ms connection TTL. The connection will now be closed. 
[code=CONNECTION_TIMEDOUT]
2023-01-02T16:50:25.222+01:00   2023-01-02 15:50:25.222 [110] WARN 
o.a.a.a.c.client - AMQ212037: Connection failure to /10.112.62.33:43358 has 
been detected: AMQ229014: Did not receive data from /10.112.62.33:43358 within 
the 60000ms connection TTL. The connection will now be closed. 
[code=CONNECTION_TIMEDOUT]
2023-01-02T16:51:27.224+01:00   2023-01-02 15:51:27.224 [110] WARN 
o.a.a.a.c.client - AMQ212037: Connection failure to /10.112.62.33:34982 has 
been detected: AMQ229014: Did not receive data from /10.112.62.33:34982 within 
the 60000ms connection TTL. The connection will now be closed. 
[code=CONNECTION_TIMEDOUT]
2023-01-02T16:52:29.227+01:00   2023-01-02 15:52:29.227 [97] WARN 
o.a.a.a.c.client - AMQ212037: Connection failure to /10.112.62.33:40942 has 
been detected: AMQ229014: Did not receive data from /10.112.62.33:40942 within 
the 60000ms connection TTL. The connection will now be closed. 
[code=CONNECTION_TIMEDOUT]
2023-01-02T16:53:31.229+01:00   2023-01-02 15:53:31.229 [92] WARN 
o.a.a.a.c.client - AMQ212037: Connection failure to /10.112.62.33:37634 has 
been detected: AMQ229014: Did not receive data from /10.112.62.33:37634 within 
the 60000ms connection TTL. The connection will now be closed. 
[code=CONNECTION_TIMEDOUT]
2023-01-02T16:54:33.232+01:00   2023-01-02 15:54:33.231 [3397] WARN 
o.a.a.a.c.client - AMQ212037: Connection failure to /10.112.62.33:58302 has 
been detected: AMQ229014: Did not receive data from /10.112.62.33:58302 within 
the 60000ms connection TTL. The connection will now be closed. 
[code=CONNECTION_TIMEDOUT]
2023-01-02T16:55:35.234+01:00   2023-01-02 15:55:35.234 [78] WARN 
o.a.a.a.c.client - AMQ212037: Connection failure to /10.112.62.33:59364 has 
been detected: AMQ229014: Did not receive data from /10.112.62.33:59364 within 
the 60000ms connection TTL. The connection will now be closed. 
[code=CONNECTION_TIMEDOUT]
2023-01-02T16:56:37.236+01:00   2023-01-02 15:56:37.236 [97] WARN 
o.a.a.a.c.client - AMQ212037: Connection failure to /10.112.62.33:40052 has 
been detected: AMQ229014: Did not receive data from /10.112.62.33:40052 within 
the 60000ms connection TTL. The connection will now be closed. 
[code=CONNECTION_TIMEDOUT]
2023-01-02T16:57:39.239+01:00   2023-01-02 15:57:39.239 [105] WARN 
o.a.a.a.c.client - AMQ212037: Connection failure to /10.112.62.33:38354 has 
been detected: AMQ229014: Did not receive data from /10.112.62.33:38354 within 
the 60000ms connection TTL. The connection will now be closed. 
[code=CONNECTION_TIMEDOUT]
2023-01-02T16:58:41.241+01:00   2023-01-02 15:58:41.241 [98] WARN 
o.a.a.a.c.client - AMQ212037: Connection failure to /10.112.62.33:42590 has 
been detected: AMQ229014: Did not receive data from /10.112.62.33:42590 within 
the 60000ms connection TTL. The connection will now be closed. 
[code=CONNECTION_TIMEDOUT]
2023-01-02T16:59:43.243+01:00   2023-01-02 15:59:43.243 [88] WARN 
o.a.a.a.c.client - AMQ212037: Connection failure to /10.112.62.33:35532 has 
been detected: AMQ229014: Did not receive data from /10.112.62.33:35532 within 
the 60000ms connection TTL. The connection will now be closed. 
[code=CONNECTION_TIMEDOUT]

Strange is consumer count on upstream nodes that are often far above some 
reasonable values - following is a scraping from artemis_consumer_count metric 
on queue that is federated
[cid:image001.png@01D92FDD.5300A790]

You can see strange increased from 0 to hundreds and then some day later 
cleanup.
As new consumers being created, and old ones not being removed.


Similarly, from time to time we observe enormous number of sessions being 
maintained on upstream nodes - e.g. 820 on single connection.
[cid:image002.png@01D92FDD.5300A790]

When drill in you can see majority sessions on this connection have been 
created at the same time.
[cid:image003.png@01D92FDD.5300A790]

Then the only way to get rid of this is to close manually the connection from 
the console.

Heap usage:
[cid:image004.png@01D92FDD.5300A790]


When analysing memory dump using MAT following is reported:

Problem Suspect 1

One instance of 
"org.apache.activemq.artemis.core.client.impl.ServerLocatorImpl" loaded by 
"org.springframework.boot.loader.LaunchedURLClassLoader @ 0xb02ba788" occupies 
65 125 032 (34,91%) bytes. The memory is accumulated in one instance of 
"java.util.HashMap$Node[]", loaded by "<system class loader>", which occupies 
65 119 184 (34,90%) bytes.

Keywords
org.apache.activemq.artemis.core.client.impl.ServerLocatorImpl
org.springframework.boot.loader.LaunchedURLClassLoader @ 0xb02ba788
java.util.HashMap$Node[]


java.util.HashMap$Node[256] @ 0xb6cfca48
1 040 65 119 184
\table java.util.HashMap @ 0xb1bc46c8
48 65 119 248
.\map java.util.HashSet @ 0xb1bc46b8
16 65 119 264
..\factories org.apache.activemq.artemis.core.client.impl.ServerLocatorImpl @ 
0xb1bc4620
136 65 125 032
...+serverLocator 
org.apache.activemq.artemis.core.server.federation.FederationConnection @ 
0xb1bc45f0
48 48
...|\connection 
org.apache.activemq.artemis.core.server.federation.FederationUpstream @ 
0xb1bc4530
48 480
...|.\upstream 
org.apache.activemq.artemis.core.server.federation.queue.FederatedQueue @ 
0xb1bc43a0
56 7 392
...|..\[2] java.lang.Object[4] @ 0xb16fe5b8
32 48
...|...\array java.util.concurrent.CopyOnWriteArrayList @ 0xb16fe590
24 88
...|....\brokerPlugins 
org.apache.activemq.artemis.core.config.impl.FileConfiguration @ 0xb16f7460
592 7 272
...|.....\configuration 
org.apache.activemq.artemis.core.server.impl.ActiveMQServerImpl @ 0xb16f7208
280 4 168
...|......+server 
org.apache.activemq.artemis.core.remoting.server.impl.RemotingServiceImpl @ 
0xb3387f60
96 1 952
...|......|\this$0 
org.apache.activemq.artemis.core.remoting.server.impl.RemotingServiceImpl$FailureCheckAndFlushThread
 @ 0xb33ce5f8 activemq-failure-check-thread Thread
128 336
...|......+server org.apache.activemq.artemis.core.server.impl.ServerStatus @ 
0xb1bc8d38 >
24 1 184
...|......\Total: 2 entries

...+serverLocator 
org.apache.activemq.artemis.core.client.impl.ClientSessionFactoryImpl @ 
0xd090a7c8 >


org.apache.activemq.artemis.core.client.impl.ServerLocatorImpl @ 0xb1bc4620
136 65 125 032 34,91%
\java.util.HashSet @ 0xb1bc46b8
16 65 119 264 34,90%
.\java.util.HashMap @ 0xb1bc46c8
48 65 119 248 34,90%
..\java.util.HashMap$Node[256] @ 0xb6cfca48
1 040 65 119 184 34,90%
...+java.util.HashMap$Node @ 0xb59106b8
32 236 464 0,13%
...+java.util.HashMap$Node @ 0xb620b438
32 181 944 0,10%
...+java.util.HashMap$Node @ 0xb3e5c6a0
32 169 344 0,09%
...+java.util.HashMap$Node @ 0xb3bb0cd8
32 147 120 0,08%
...+java.util.HashMap$Node @ 0xb5169f88
32 139 184 0,07%
...+java.util.HashMap$Node @ 0xb4456400
32 131 296 0,07%
...+java.util.HashMap$Node @ 0xb59e6668
32 126 616 0,07%
...+java.util.HashMap$Node @ 0xb58818a0
32 121 504 0,07%
...+java.util.HashMap$Node @ 0xb5839f40
32 120 560 0,06%
...+java.util.HashMap$Node @ 0xb84fd158
32 120 496 0,06%
...+java.util.HashMap$Node @ 0xb58104a8
32 118 000 0,06%
...+java.util.HashMap$Node @ 0xb5947bd8
32 115 920 0,06%
...+java.util.HashMap$Node @ 0xb84e3be8
32 115 024 0,06%
...+java.util.HashMap$Node @ 0xb8488828
32 113 400 0,06%
...+java.util.HashMap$Node @ 0xb3bc5258
32 110 704 0,06%
...+java.util.HashMap$Node @ 0xb3e2cdc8
32 110 208 0,06%
...+java.util.HashMap$Node @ 0xb57bda48
32 109 688 0,06%
...+java.util.HashMap$Node @ 0xb57d2880
32 109 232 0,06%
...+org.apache.activemq.artemis.core.client.impl.ClientSessionFactoryImpl @ 
0xb62518c8
192 106 944 0,06%
...+org.apache.activemq.artemis.core.client.impl.ClientSessionFactoryImpl @ 
0xd0b00a40
192 98 336 0,05%
...\Total: 20 entries
960 2 601 984 1,39%


When removing federation config on selected node, memory consumption on this 
node comes back to normal.

Thanks
Michal Balicki


Confidentiality Notice: This message and any included attachments are from EML 
and are intended only for the addressee(s). The information contained in this 
message is confidential and may constitute inside or non-public information 
under international, federal or state laws. Unauthorized forwarding, printing, 
copying, distribution or use of such information is strictly prohibited and may 
be unlawful. If you are not the addressee, please promptly delete this message 
and notify the sender of the delivery error by email. Thank you for your 
cooperation.

ODP: Federation queues - issue with cleaning up resources?

Reply via email to