Hi Team,
I would like to update this question. There are still unresolved problems, and I cannot figure out how to reproduce it in an isolated environment. Is there any ideas when this error could happen on the backup server? AMQ219014: Timed out after waiting 30000 ms for response when sending packet 125. Packet 125 is CLUSTER_CONNECT (https://github.com/apache/activemq-artemis/blob/main/artemis-core-client/src/main/java/org/apache/activemq/artemis/core/protocol/core/impl/PacketImpl.java) And I still unable to find out which option is used for this timeout on the backup broker. I have specified call-timeout=120000 on each cluster connector URL, default acceptor, initial-replication-sync-timeout (in the ha-policy/replication), and connection-ttl / call-timeout (in cluster-connection). And still got 30000 ms timeout. -- Best regards, Aleksandr From: MILOVIDOV Aleksandr Sent: Friday, October 20, 2023 1:17 AM To: 'users@activemq.apache.org' <users@activemq.apache.org> Subject: RE: Problems with Artemis primary/backup replication Hi Team, Is there any ideas on what is wrong with replication? Any suggestions on tuning timeout parameters? Which parameter should I configure to increase timeout which appears in this log message? WARN [org.apache.activemq.artemis.core.server] AMQ222137: Unable to announce backup, retrying org.apache.activemq.artemis.api.core.ActiveMQConnectionTimedOutException: AMQ219014: Timed out after waiting 30000 ms for response when sending packet 125 at org.apache.activemq.artemis.core.protocol.core.impl.ChannelImpl.sendBlocking(ChannelImpl.java:550) ~[artemis-core-client-2.31.0.jar:2.31.0] at org.apache.activemq.artemis.core.protocol.core.impl.ChannelImpl.sendBlocking(ChannelImpl.java:452) ~[artemis-core-client-2.31.0.jar:2.31.0] at org.apache.activemq.artemis.core.server.cluster.ClusterControl.authorize(ClusterControl.java:80) ~[artemis-server-2.31.0.jar:2.31.0] at org.apache.activemq.artemis.core.server.cluster.BackupManager$BackupConnector$1.run(BackupManager.java:277) ~[artemis-server-2.31.0.jar:2.31.0] at org.apache.activemq.artemis.utils.actors.OrderedExecutor.doTask(OrderedExecutor.java:57) ~[artemis-commons-2.31.0.jar:?] at org.apache.activemq.artemis.utils.actors.OrderedExecutor.doTask(OrderedExecutor.java:32) ~[artemis-commons-2.31.0.jar:?] at org.apache.activemq.artemis.utils.actors.ProcessorBase.executePendingTasks(ProcessorBase.java:68) ~[artemis-commons-2.31.0.jar:?] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?] at org.apache.activemq.artemis.utils.ActiveMQThreadFactory$1.run(ActiveMQThreadFactory.java:118) [artemis-commons-2.31.0.jar:?] I have increased initial-replication-sync-timeout to 120000 and it did not helped. I have also increased call-timeout and connection-ttl on the cluster connections to 120000 and it seems that call-timeout is not applied on the backup server (it has default value 30000 in the logs). -- Best regards, Aleksandr From: MILOVIDOV Aleksandr Sent: Tuesday, October 17, 2023 6:39 PM To: users@activemq.apache.org Subject: Problems with Artemis primary/backup replication Hi All, We have problems with replication on some primary/backup nodes of Artemis cluster. This cluster has 3 primary and 3 backup nodes. Artemis version is 2.30.0. It looks like replication was more stable in version 2.28.0, but I'm not sure. The problem with replication often occurs after restart of the servers. For example, we perform sequential restart of primary and backup server with 6 minutes interval between restarts. When primary server is restarted, it's live role is transferred to the backup server. After restart of the primary server, live role is transferred back to the primary server after it has been replicated. Sometimes backup server fails to replicate and does not try to re-establish replication. Usually it writes to the log message like: AMQ219014: Timed out after waiting 30000 ms for response when sending packet 125. When it occurs, primary server logs these events: AMQ224014: Failed to close session java.lang.NullPointerException: null AMQ222013: Error when trying to start replication AMQ229114: Replication synchronization process timed out after waiting 30000 milliseconds When I tried to restart only backup server, nothing happens and it does not retry to replicate with live server. If I stop the backup server, remove data/* folders and restart, it begins replication. When replication performs, some clients report problems with connection to the primary server, maybe because of huge replication traffic. Sometimes it helps, but usually it ends up with the same error. I don't know if it is a bug or we have hit some performance or size limits. Anyway, we plan to split this cluster to several smaller clusters. The servers have 8 x86 processors and 20 Gb of RAM. Java heap size is 12 Gb (Xms equals Xmx). Address memory usage is about 150-250 Mb. Current size of data directory on the server is about 6-7 Gb. 98% of its size is journal, other folders are relatively small: bindings - 20 Mb, paging - 9 Mb, large messages - 86 Mb). I tried to find out which timeout has default value of 30000 and increased call-timeout on the cluster connections to 60000 ms. After this change, the error on the backup server has not changed (and timeout value is still 30000). On the primary server, timeout in logs has changed to 60000. Other cluster connection settings: <check-period>5000</check-period> <connection-ttl>120000</connection-ttl> <retry-interval>500</retry-interval> <use-duplicate-detection>true</use-duplicate-detection> <message-load-balancing>ON_DEMAND</message-load-balancing> <producer-window-size>-1</producer-window-size> I have found an article in the Red Hat knowledge base: https://access.redhat.com/solutions/6190581 with error description which looks similar and related to JBoss EAP 7. "Apache Artemis CORE protocol expects some packets to be acknowledged within 30 seconds (default). If for some reason the acknowledge packet does not arrived within that time, the above message AMQ219014 is logged. It is possible to increase the default (30 seconds) call-timeout, but it is not recommended since it most cases it will not fix the problem only mask it." The possible reasons are high CPU usage and long-running garbage collection. After we have increased number of processors, CPU usage on these servers seems to be normal (about 40%). Garbage collection metric jvm_gc_collection_seconds_sum is relatively small (0.1% of broker uptime). I tried to restart servers and fix replication problem in the night period where server load is not high, and error was the same. Logs of the primary server after repication error: (after some replication sending file events) ... 2023-10-14 00:40:49,711 INFO [org.apache.activemq.artemis.core.server] AMQ221025: Replication: sending NIOSequentialFile /var/amq/data/paging/52cd999e-6975-11ee-b293-005056b9aa5c/000000004.page (size=0) to replica. 2023-10-14 00:40:50,891 WARN [org.apache.activemq.artemis.core.server] AMQ222061: Client connection failed, clearing up resources for session 2d22af15-6a11-11ee-8c13-76926571cbfc 2023-10-14 00:40:50,891 ERROR [org.apache.activemq.artemis.core.server] AMQ224014: Failed to close session java.lang.NullPointerException: null at org.apache.activemq.artemis.core.server.impl.ServerSessionImpl.doClose(ServerSessionImpl.java:457) ~[artemis-server-2.30.0.jar:2.30.0] at org.apache.activemq.artemis.core.server.impl.ServerSessionImpl$1.done(ServerSessionImpl.java:1730) ~[artemis-server-2.30.0.jar:2.30.0] at org.apache.activemq.artemis.core.persistence.impl.journal.OperationContextImpl.executeOnCompletion(OperationContextImpl.java:226) ~[artemis-server-2.30.0.jar:2.30.0] at org.apache.activemq.artemis.core.persistence.impl.journal.OperationContextImpl.executeOnCompletion(OperationContextImpl.java:168) ~[artemis-server-2.30.0.jar:2.30.0] at org.apache.activemq.artemis.core.server.impl.ServerSessionImpl.close(ServerSessionImpl.java:1722) ~[artemis-server-2.30.0.jar:2.30.0] at org.apache.activemq.artemis.core.server.impl.ServerSessionImpl.close(ServerSessionImpl.java:1710) ~[artemis-server-2.30.0.jar:2.30.0] at org.apache.activemq.artemis.core.server.impl.ServerSessionImpl.connectionFailed(ServerSessionImpl.java:2147) ~[artemis-server-2.30.0.jar:2.30.0] at org.apache.activemq.artemis.core.server.impl.ServerSessionImpl.connectionFailed(ServerSessionImpl.java:2157) ~[artemis-server-2.30.0.jar:2.30.0] at org.apache.activemq.artemis.spi.core.protocol.AbstractRemotingConnection.callFailureListeners(AbstractRemotingConnection.java:98) ~[artemis-core-client-2.30.0.jar:2.30.0] at org.apache.activemq.artemis.core.protocol.core.impl.RemotingConnectionImpl.fail(RemotingConnectionImpl.java:212) ~[artemis-core-client-2.30.0.jar:2.30.0] at org.apache.activemq.artemis.spi.core.protocol.AbstractRemotingConnection.fail(AbstractRemotingConnection.java:232) ~[artemis-core-client-2.30.0.jar:2.30.0] at org.apache.activemq.artemis.core.remoting.server.impl.RemotingServiceImpl.issueFailure(RemotingServiceImpl.java:617) ~[artemis-server-2.30.0.jar:2.30.0] at org.apache.activemq.artemis.core.remoting.server.impl.RemotingServiceImpl.connectionDestroyed(RemotingServiceImpl.java:598) ~[artemis-server-2.30.0.jar:2.30.0] at org.apache.activemq.artemis.core.remoting.impl.netty.NettyAcceptor$Listener.connectionDestroyed(NettyAcceptor.java:974) ~[artemis-server-2.30.0.jar:2.30.0] at org.apache.activemq.artemis.core.remoting.impl.netty.ActiveMQChannelHandler.lambda$channelInactive$0(ActiveMQChannelHandler.java:89) ~[artemis-core-client-2.30.0.jar:2.30.0] at org.apache.activemq.artemis.utils.actors.OrderedExecutor.doTask(OrderedExecutor.java:57) ~[artemis-commons-2.30.0.jar:?] at org.apache.activemq.artemis.utils.actors.OrderedExecutor.doTask(OrderedExecutor.java:32) ~[artemis-commons-2.30.0.jar:?] at org.apache.activemq.artemis.utils.actors.ProcessorBase.executePendingTasks(ProcessorBase.java:68) ~[artemis-commons-2.30.0.jar:?] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?] at org.apache.activemq.artemis.utils.ActiveMQThreadFactory$1.run(ActiveMQThreadFactory.java:118) [artemis-commons-2.30.0.jar:?] 2023-10-14 00:40:50,892 WARN [org.apache.activemq.artemis.core.server] AMQ222107: Cleared up resources for session 2d22af15-6a11-11ee-8c13-76926571cbfc 2023-10-14 00:41:00,536 WARN [org.apache.activemq.artemis.core.server] AMQ222061: Client connection failed, clearing up resources for session 2351b660-6a11-11ee-8e87-9a9aaf54056d ... (skipped several repeated messages AMQ224014: Failed to close session java.lang.NullPointerException: null) ... 2023-10-14 00:41:59,720 WARN [org.apache.activemq.artemis.core.server] AMQ222013: Error when trying to start replication org.apache.activemq.artemis.api.core.ActiveMQReplicationTimeooutException: AMQ229114: Replication synchronization process timed out after waiting 60000 milliseconds at org.apache.activemq.artemis.core.replication.ReplicationManager.sendSynchronizationDone(ReplicationManager.java:829) ~[artemis-server-2.30.0.jar:2.30.0] at org.apache.activemq.artemis.core.persistence.impl.journal.JournalStorageManager.startReplication(JournalStorageManager.java:677) ~[artemis-server-2.30.0.jar:2.30.0] at org.apache.activemq.artemis.core.server.impl.ReplicationPrimaryActivation.replicate(ReplicationPrimaryActivation.java:244) ~[artemis-server-2.30.0.jar:2.30.0] at org.apache.activemq.artemis.core.server.impl.ReplicationPrimaryActivation.lambda$startAsyncReplication$1(ReplicationPrimaryActivation.java:232) ~[artemis-server-2.30.0.jar:2.30.0] at java.lang.Thread.run(Thread.java:829) [?:?] Backup server logs: 2023-10-14 00:40:32,251 INFO [org.apache.activemq.artemis.core.server.impl.ReplicationBackupActivation] Apache ActiveMQ Artemis Backup Server version 2.30.0 [ActiveMQServerImpl::name=amq-prod1] started, awaiting connection to a live cluster member to start replication 2023-10-14 00:40:53,960 INFO [org.apache.activemq.artemis.core.server] AMQ221024: Backup server ActiveMQServerImpl::name=amq-prod1 is synchronized with live server, nodeID=3e8b6a7b-655d-11eb-a214-005056b9aa5c. 2023-10-14 00:41:23,963 WARN [org.apache.activemq.artemis.core.client] AMQ212037: Connection failure to servername-amq1.domain.com/10.10.0.1:61616 has been detected: AMQ219014: Timed out after waiting 30000 ms for response when sending packet 125 [code=CONNECTION_TIMEDOUT] 2023-10-14 00:41:23,963 WARN [org.apache.activemq.artemis.core.server] AMQ222137: Unable to announce backup, retrying org.apache.activemq.artemis.api.core.ActiveMQConnectionTimedOutException: AMQ219014: Timed out after waiting 30000 ms for response when sending packet 125 at org.apache.activemq.artemis.core.protocol.core.impl.ChannelImpl.sendBlocking(ChannelImpl.java:550) ~[artemis-core-client-2.30.0.jar:2.30.0] at org.apache.activemq.artemis.core.protocol.core.impl.ChannelImpl.sendBlocking(ChannelImpl.java:452) ~[artemis-core-client-2.30.0.jar:2.30.0] at org.apache.activemq.artemis.core.server.cluster.ClusterControl.authorize(ClusterControl.java:80) ~[artemis-server-2.30.0.jar:2.30.0] at org.apache.activemq.artemis.core.server.cluster.BackupManager$BackupConnector$1.run(BackupManager.java:277) ~[artemis-server-2.30.0.jar:2.30.0] at org.apache.activemq.artemis.utils.actors.OrderedExecutor.doTask(OrderedExecutor.java:57) ~[artemis-commons-2.30.0.jar:?] at org.apache.activemq.artemis.utils.actors.OrderedExecutor.doTask(OrderedExecutor.java:32) ~[artemis-commons-2.30.0.jar:?] at org.apache.activemq.artemis.utils.actors.ProcessorBase.executePendingTasks(ProcessorBase.java:68) ~[artemis-commons-2.30.0.jar:?] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?] at org.apache.activemq.artemis.utils.ActiveMQThreadFactory$1.run(ActiveMQThreadFactory.java:118) [artemis-commons-2.30.0.jar:?] 2023-10-14 00:41:54,471 WARN [org.apache.activemq.artemis.core.client] AMQ212037: Connection failure to servername-amq1.domain.com/10.10.0.1:61616 has been detected: AMQ219014: Timed out after waiting 30000 ms for response when sending packet 125 [code=CONNECTION_TIMEDOUT] 2023-10-14 00:41:54,471 WARN [org.apache.activemq.artemis.core.server] AMQ222137: Unable to announce backup, retrying org.apache.activemq.artemis.api.core.ActiveMQConnectionTimedOutException: AMQ219014: Timed out after waiting 30000 ms for response when sending packet 125 at org.apache.activemq.artemis.core.protocol.core.impl.ChannelImpl.sendBlocking(ChannelImpl.java:550) ~[artemis-core-client-2.30.0.jar:2.30.0] at org.apache.activemq.artemis.core.protocol.core.impl.ChannelImpl.sendBlocking(ChannelImpl.java:452) ~[artemis-core-client-2.30.0.jar:2.30.0] at org.apache.activemq.artemis.core.server.cluster.ClusterControl.authorize(ClusterControl.java:80) ~[artemis-server-2.30.0.jar:2.30.0] at org.apache.activemq.artemis.core.server.cluster.BackupManager$BackupConnector$1.run(BackupManager.java:277) ~[artemis-server-2.30.0.jar:2.30.0] at org.apache.activemq.artemis.utils.actors.OrderedExecutor.doTask(OrderedExecutor.java:57) ~[artemis-commons-2.30.0.jar:?] at org.apache.activemq.artemis.utils.actors.OrderedExecutor.doTask(OrderedExecutor.java:32) ~[artemis-commons-2.30.0.jar:?] at org.apache.activemq.artemis.utils.actors.ProcessorBase.executePendingTasks(ProcessorBase.java:68) ~[artemis-commons-2.30.0.jar:?] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?] at org.apache.activemq.artemis.utils.ActiveMQThreadFactory$1.run(ActiveMQThreadFactory.java:118) [artemis-commons-2.30.0.jar:?] 2023-10-14 00:41:59,830 INFO [org.apache.activemq.artemis.core.server] AMQ221031: backup announced When stopping backup server which is not replicated: 2023-10-14 01:35:55,472 INFO [org.apache.activemq.artemis.core.server] AMQ221031: backup announced 2023-10-14 01:47:54,495 WARN [org.apache.activemq.artemis.core.server.impl.ReplicationBackupActivation] Errored while attempting failover java.lang.InterruptedException: null at java.lang.Object.wait(Native Method) ~[?:?] at java.lang.Object.wait(Object.java:328) ~[?:?] at org.apache.zookeeper.ClientCnxn.submitRequest(ClientCnxn.java:1594) ~[zookeeper-3.6.3.jar:3.6.3] at org.apache.zookeeper.ClientCnxn.submitRequest(ClientCnxn.java:1566) ~[zookeeper-3.6.3.jar:3.6.3] at org.apache.zookeeper.ZooKeeper.delete(ZooKeeper.java:1999) ~[zookeeper-3.6.3.jar:3.6.3] at org.apache.curator.framework.imps.DeleteBuilderImpl$6.call(DeleteBuilderImpl.java:313) ~[curator-framework-5.2.0.jar:5.2.0] at org.apache.curator.framework.imps.DeleteBuilderImpl$6.call(DeleteBuilderImpl.java:301) ~[curator-framework-5.2.0.jar:5.2.0] at org.apache.curator.RetryLoop.callWithRetry(RetryLoop.java:93) ~[curator-client-5.2.0.jar:?] at org.apache.curator.framework.imps.DeleteBuilderImpl.pathInForeground(DeleteBuilderImpl.java:298) ~[curator-framework-5.2.0.jar:5.2.0] at org.apache.curator.framework.imps.DeleteBuilderImpl.forPath(DeleteBuilderImpl.java:282) ~[curator-framework-5.2.0.jar:5.2.0] at org.apache.curator.framework.imps.DeleteBuilderImpl.forPath(DeleteBuilderImpl.java:35) ~[curator-framework-5.2.0.jar:5.2.0] at org.apache.curator.framework.recipes.locks.LockInternals.deleteOurPath(LockInternals.java:347) ~[curator-recipes-5.2.0.jar:5.2.0] at org.apache.curator.framework.recipes.locks.LockInternals.releaseLock(LockInternals.java:124) ~[curator-recipes-5.2.0.jar:5.2.0] at org.apache.curator.framework.recipes.locks.InterProcessMutex.release(InterProcessMutex.java:154) ~[curator-recipes-5.2.0.jar:5.2.0] at org.apache.curator.framework.recipes.locks.InterProcessSemaphoreV2.internalAcquire1Lease(InterProcessSemaphoreV2.java:435) ~[curator-recipes-5.2.0.jar:5.2.0] at org.apache.curator.framework.recipes.locks.InterProcessSemaphoreV2.acquire(InterProcessSemaphoreV2.java:284) ~[curator-recipes-5.2.0.jar:5.2.0] at org.apache.curator.framework.recipes.locks.InterProcessSemaphoreV2.acquire(InterProcessSemaphoreV2.java:246) ~[curator-recipes-5.2.0.jar:5.2.0] at org.apache.activemq.artemis.quorum.zookeeper.CuratorDistributedLock.lambda$tryLock$1(CuratorDistributedLock.java:102) ~[artemis-quorum-ri-2.30.0.jar:2.30.0] at org.apache.activemq.artemis.quorum.zookeeper.CuratorDistributedPrimitive.tryRun(CuratorDistributedPrimitive.java:140) ~[artemis-quorum-ri-2.30.0.jar:2.30.0] at org.apache.activemq.artemis.quorum.zookeeper.CuratorDistributedLock.tryLock(CuratorDistributedLock.java:94) ~[artemis-quorum-ri-2.30.0.jar:2.30.0] at org.apache.activemq.artemis.quorum.DistributedLock.tryLock(DistributedLock.java:56) ~[artemis-quorum-api-2.30.0.jar:?] at org.apache.activemq.artemis.core.server.impl.quorum.ActivationSequenceStateMachine.tryActivate(ActivationSequenceStateMachine.java:94) ~[artemis-server-2.30.0.jar:2.30.0] at org.apache.activemq.artemis.core.server.impl.ReplicationBackupActivation.replicateAndFailover(ReplicationBackupActivation.java:333) [artemis-server-2.30.0.jar:2.30.0] at org.apache.activemq.artemis.core.server.impl.ReplicationBackupActivation.run(ReplicationBackupActivation.java:216) [artemis-server-2.30.0.jar:2.30.0] at org.apache.activemq.artemis.core.server.impl.ActiveMQServerImpl$ActivationThread.run(ActiveMQServerImpl.java:4509) [artemis-server-2.30.0.jar:2.30.0] 2023-10-14 01:47:54,496 INFO [org.apache.activemq.artemis.core.server] AMQ221070: Restarting as backup based on quorum vote results. 2023-10-14 01:47:54,515 ERROR [org.apache.curator.framework.imps.FindAndDeleteProtectedNodeInBackground] Could not start guaranteed delete for node: /_c_368d99c8-e7e7-4fdf-b931-d3cce0767ef4-lease-0001227099 2023-10-14 01:47:54,611 INFO [org.apache.activemq.artemis] AMQ241005: Stopping embedded web server 2023-10-14 01:47:54,620 INFO [io.hawt.web.auth.AuthenticationFilter] Destroying hawtio authentication filter 2023-10-14 01:47:54,623 INFO [io.hawt.HawtioContextListener] Destroying hawtio services 2023-10-14 01:47:54,632 INFO [org.apache.activemq.hawtio.plugin.PluginContextListener] Destroyed artemis-plugin plugin 2023-10-14 01:47:54,636 INFO [org.apache.activemq.hawtio.branding.PluginContextListener] Destroyed activemq-branding plugin 2023-10-14 01:47:54,653 INFO [org.apache.activemq.artemis] AMQ241006: Stopped embedded web server 2023-10-14 01:47:54,654 INFO [org.apache.activemq.artemis.core.server] AMQ221002: Apache ActiveMQ Artemis Message Broker version 2.30.0 [3e8b6a7b-655d-11eb-a214-005056b9aa5c] stopped, uptime 13 minutes -- Best regards, Aleksandr ----------------------------------- This message and any attachment are confidential and may be privileged or otherwise protected from disclosure. If you are not the intended recipient any use, distribution, copying or disclosure is strictly prohibited. If you have received this message in error, please notify the sender immediately either by telephone or by e-mail and delete this message and any attachment from your system. Correspondence via e-mail is for information purposes only. AO Raiffeisenbank neither makes nor accepts legally binding statements by e-mail unless otherwise agreed. -----------------------------------