Hi Pedro,

when the backup broker announces itself, the core clients receive the
connector url defined in the cluster-connection. After a failover, the core
clients try to connect to the connector url of the backup broker received.
This works fine if the brokers and the clients are in the same network.
When the brokers are deployed in a Kubernetes cluster and the clients are
external, they fail to connect to the connector url of the backup broker
received, see ARTEMIS-3640[1]. You can fix this issue by setting the
connector name in the connection url of the clients to overwrite the
connector url of the backup broker received, for further details
see testJMSConsumerAfterFailover[2].

Alternatively, I would suggest taking a look at the leader-follower
solution in the ArkMQ Operator test-suite:
https://github.com/arkmq-org/activemq-artemis-operator/blob/main/controllers/activemqartemis_rwm_pvc_ha_test.go


[1] https://issues.apache.org/jira/browse/ARTEMIS-3640
[2]
https://github.com/apache/activemq-artemis/blob/2.42.0/tests/integration-tests/src/test/java/org/apache/activemq/artemis/tests/integration/cluster/failover/ClientConnectorFailoverTest.java#L313

Regards,
Domenico

On Mon, 29 Sept 2025 at 15:34, Teixeira Pedro (BT-VS/ESW-CSA4)
<[email protected]> wrote:

> Hello Artemis community!
>
>
>
> We are trying to create a solution using Artemis within Kubernetes.
>
> Our requirement is that if a broker is offline for some reason - network
> issue or the instance shuts down either gracefully or suddenly - the
> clients can still produce and consume messages as usual.
>
>
>
> To do that we tried a High Availability solution with Primary + Backup
> brokers (both with replicated and shared storage)
>
> Each broker is a Stateful Set and there's Kubernetes Service to ensure
> connectivity to the brokers.
>
> To connect to the primary and backup we consider Headless Services (
> https://kubernetes.io/docs/concepts/services-networking/service/#headless-services
> ).
>
> For example,  artemis-ha-primary-0.svc-artemis-ha.svc.cluster.local is
> the Headless Service to access Pod artemis-ha-primary-0 through Service
> svc-artemis-ha-primary.
>
>
>
> We tested both on Artemis v2.38.0 and Artemis v2.42.0.
>
>
>
> We tested the solution by
>
>    1. Producing messages using artemis producer command: "./bin/artemis
>    producer --url
>    
> "(tcp://artemis-ha-primary-0.svc-artemis-ha.svc.cluster.local:61616,tcp://artemis-ha-backup-0.svc-artemis-ha.svc.cluster.local:61616)?ha=true&retryInterval=1000&retryIntervalMultiplier=1.0&reconnectAttempts=-1"
>    --destination queue://... --user ... --password ....--message-count 10000
>    --sleep 2 --verbose
>
>
>    2. Crashing the Primary instance (through deletion of the pod or
>    scaling down of the Stateful Set we have Artemis within)
>
> We expected the client to be able to recover.
>
> However, we were not able to see that.
>
>
>
> Analyzing the brokers logs and console, there is connectivity between the
> primary and the backup
>
>
>
> (primary)
>
> INFO [org.apache.activemq.artemis.core.server] AMQ221035: Primary Server
> Obtained primary lock
>
>
>
>
>
> (backup)
>
> INFO [org.apache.activemq.artemis.core.server] AMQ221033: ** got backup
> lock
>
> INFO [org.apache.activemq.artemis.core.server] AMQ221031: backup announced
>
>
>
> DEBUG
> [org.apache.activemq.artemis.core.client.impl.ClientSessionFactoryImpl]
> Connected with the
> currentConnectorConfig=TransportConfiguration(name=primary,
> factory=org-apache-activemq-artemis-core-remoting-impl-netty-NettyConnectorFactory)?port=61616&host=svc-artemis-ha-primary-svc-cluster-local
>
>
>
> But when we crash the primary broker we can see that the backup instance
> enters a loop of
>
>
>
> ERROR [org.apache.activemq.artemis.core.client] AMQ214016: Failed to
> create netty connection
>
> java.net.UnknownHostException: svc-artemis-ha-primary.svc.cluster.local
>
>
>
> And never promotes itself to a live broker.
>
>
>
> We were able to make it work only when stopping the primary broker
> gracefully (artemis stop)
>
>
>
> INFO  [org.apache.activemq.artemis.core.client] AMQ214036: Connection closure 
> to svc-artemis-ha-primary.svc.cluster.local/....:61616 has been detected: 
> AMQ219015: The connection was disconnected because of server shutdown 
> [code=DISCONNECTED]
>
> INFO  [org.apache.activemq.artemis.core.client] AMQ214036: Connection closure 
> to svc-artemis-ha-primary.svc.cluster.local/...:61616 has been detected: 
> AMQ219015: The connection was disconnected because of server shutdown 
> [code=DISCONNECTED]
>
> INFO  [org.apache.activemq.artemis.core.server] AMQ221010: Backup Server is 
> now active
>
>
>
> We'd like to ask if what we are trying to do in the context of Kubernetes
> is possible, if we have stumbled upon a bug or limitation or if we did
> something wrong.
>
>
>
> Thank you very much in advance for the help.
>
> I'm available for additional questions if needed.
>
>
>
> Best regards,
>
> *Pedro Teixeira*
>
>
>
> PS: For reference, our broker configurations are as follows:
>
> -- Primary
>
> <connectors>
>
>     <connector
> name="self">tcp://artemis-ha-primary-0.svc-artemis-ha.svc.cluster.local:61616</connector>
>
>     <connector
> name="backup">tcp://artemis-ha-backup-0.svc-artemis-ha.svc.cluster.local:61616</connector>
>
> </connectors>
>
>
>
> <ha-policy>
>
>     <replication>
>
>         <primary>
>
>             </primary>
>
>     </replication>
>
>     </ha-policy>
>
>
>
> <cluster-user>ACTIVEMQ.CLUSTER.ADMIN.USER</cluster-user>
>
> <cluster-password>....</cluster-password>
>
> <cluster-connections>
>
>     <cluster-connection name="artemis-cluster">
>
>         <!-- All addresses -->
>
>         <address></address>
>
>         <connector-ref>self</connector-ref>
>
>
>         <retry-interval>500</retry-interval>
>
>         <use-duplicate-detection>true</use-duplicate-detection>
>
>         <message-load-balancing>OFF</message-load-balancing>
>
>         <max-hops>1</max-hops>
>
>         <static-connectors>
>
>             <connector-ref>backup</connector-ref>
>
>         </static-connectors>
>
>     </cluster-connection>
>
> </cluster-connections>
>
>
>
> -- Backup
>
> <ha-policy>
>
>     <replication>
>
>         <backup>
>
>             <!-- ensure the backup that has become active never stops so
> it's ready to be backup again, otherwirse it will enter CrashLoopBackOff
> automatically stop per
> https://activemq.apache.org/components/artemis/documentation/latest/ha.html#failback-with-shared-store
> -->
>
>             <allow-failback>true</allow-failback>
>
>             <restart-backup>true</restart-backup>
>
>         </backup>
>
>     </replication>
>
> </ha-policy>
>
>
>
> <!-- Configure the cluster connection -->
>
> <cluster-user>ACTIVEMQ.CLUSTER.ADMIN.USER</cluster-user>
>
> <cluster-password>...</cluster-password>
>
> <cluster-connections>
>
>     <cluster-connection name="artemis-cluster">
>
>         <!-- All addresses -->
>
>         <address></address>
>
>         <connector-ref>self</connector-ref>
>
>
>         <retry-interval>500</retry-interval>
>
>         <use-duplicate-detection>true</use-duplicate-detection>
>
>         <!-- The goal is to have a primary / backup solution, not load
> balancing -->
>
>         <message-load-balancing>OFF</message-load-balancing>
>
>         <max-hops>1</max-hops>
>
>         <!-- load balacning only to other Artemis directly connected to
> this server -->
>
>         <static-connectors>
>
>             <connector-ref>primary</connector-ref>
>
>         </static-connectors>
>
>     </cluster-connection>
>
> </cluster-connections>
>
>
>
>
>

Reply via email to