Non working Artemis HA on Kubernetes

Teixeira Pedro (BT-VS/ESW-CSA4) Mon, 29 Sep 2025 06:26:36 -0700

Hello Artemis community!


We are trying to create a solution using Artemis within Kubernetes. 

Our requirement is that if a broker is offline for some reason - network
issue or the instance shuts down either gracefully or suddenly - the clients
can still produce and consume messages as usual.

 

To do that we tried a High Availability solution with Primary + Backup
brokers (both with replicated and shared storage)

Each broker is a Stateful Set and there's Kubernetes Service to ensure
connectivity to the brokers.

To connect to the primary and backup we consider Headless Services
(https://kubernetes.io/docs/concepts/services-networking/service/#headless-s
ervices). 

For example,  artemis-ha-primary-0.svc-artemis-ha.svc.cluster.local is the
Headless Service to access Pod artemis-ha-primary-0 through Service
svc-artemis-ha-primary.

 

We tested both on Artemis v2.38.0 and Artemis v2.42.0.

 

We tested the solution by 

1.      Producing messages using artemis producer command: "./bin/artemis
producer --url
"(tcp://artemis-ha-primary-0.svc-artemis-ha.svc.cluster.local:61616,tcp://ar
temis-ha-backup-0.svc-artemis-ha.svc.cluster.local:61616)?ha=true&retryInter
val=1000&retryIntervalMultiplier=1.0&reconnectAttempts=-1" --destination
queue://... --user ... --password ....--message-count 10000 --sleep 2
--verbose

2.      Crashing the Primary instance (through deletion of the pod or
scaling down of the Stateful Set we have Artemis within)

We expected the client to be able to recover. 

However, we were not able to see that.

 

Analyzing the brokers logs and console, there is connectivity between the
primary and the backup

 

(primary)

INFO [org.apache.activemq.artemis.core.server] AMQ221035: Primary Server
Obtained primary lock

 

 

(backup)

INFO [org.apache.activemq.artemis.core.server] AMQ221033: ** got backup lock

INFO [org.apache.activemq.artemis.core.server] AMQ221031: backup announced

 

DEBUG
[org.apache.activemq.artemis.core.client.impl.ClientSessionFactoryImpl]
Connected with the
currentConnectorConfig=TransportConfiguration(name=primary,
factory=org-apache-activemq-artemis-core-remoting-impl-netty-NettyConnectorF
actory)?port=61616&host=svc-artemis-ha-primary-svc-cluster-local

 

But when we crash the primary broker we can see that the backup instance
enters a loop of 

 

ERROR [org.apache.activemq.artemis.core.client] AMQ214016: Failed to create
netty connection 

java.net.UnknownHostException: svc-artemis-ha-primary.svc.cluster.local 

 

And never promotes itself to a live broker.

 

We were able to make it work only when stopping the primary broker
gracefully (artemis stop)

 

INFO  [org.apache.activemq.artemis.core.client] AMQ214036: Connection
closure to svc-artemis-ha-primary.svc.cluster.local/....:61616 has been
detected: AMQ219015: The connection was disconnected because of server
shutdown [code=DISCONNECTED]
INFO  [org.apache.activemq.artemis.core.client] AMQ214036: Connection
closure to svc-artemis-ha-primary.svc.cluster.local/...:61616 has been
detected: AMQ219015: The connection was disconnected because of server
shutdown [code=DISCONNECTED]
INFO  [org.apache.activemq.artemis.core.server] AMQ221010: Backup Server is
now active

 

We'd like to ask if what we are trying to do in the context of Kubernetes is
possible, if we have stumbled upon a bug or limitation or if we did
something wrong.

 

Thank you very much in advance for the help.

I'm available for additional questions if needed.

 

Best regards,

Pedro Teixeira

 

PS: For reference, our broker configurations are as follows:

-- Primary

<connectors>

    <connector
name="self">tcp://artemis-ha-primary-0.svc-artemis-ha.svc.cluster.local:6161
6</connector>

    <connector
name="backup">tcp://artemis-ha-backup-0.svc-artemis-ha.svc.cluster.local:616
16</connector>

</connectors>

 

<ha-policy>

    <replication>

        <primary>

            </primary>

    </replication>

    </ha-policy>

 

<cluster-user>ACTIVEMQ.CLUSTER.ADMIN.USER</cluster-user>

<cluster-password>....</cluster-password>

<cluster-connections>

    <cluster-connection name="artemis-cluster">

        <!-- All addresses -->

        <address></address>      

        <connector-ref>self</connector-ref>


        <retry-interval>500</retry-interval>

        <use-duplicate-detection>true</use-duplicate-detection>

        <message-load-balancing>OFF</message-load-balancing>    

        <max-hops>1</max-hops>                                  

        <static-connectors>

            <connector-ref>backup</connector-ref>

        </static-connectors>

    </cluster-connection>

</cluster-connections>

 

-- Backup

<ha-policy>

    <replication>

        <backup>

            <!-- ensure the backup that has become active never stops so
it's ready to be backup again, otherwirse it will enter CrashLoopBackOff
automatically stop per
<https://activemq.apache.org/components/artemis/documentation/latest/ha.html
#failback-with-shared-store>
https://activemq.apache.org/components/artemis/documentation/latest/ha.html#
failback-with-shared-store -->

            <allow-failback>true</allow-failback>

            <restart-backup>true</restart-backup>

        </backup>

    </replication>

</ha-policy>

 

<!-- Configure the cluster connection -->

<cluster-user>ACTIVEMQ.CLUSTER.ADMIN.USER</cluster-user>

<cluster-password>...</cluster-password>

<cluster-connections>

    <cluster-connection name="artemis-cluster">

        <!-- All addresses -->

        <address></address>      

        <connector-ref>self</connector-ref>


        <retry-interval>500</retry-interval>

        <use-duplicate-detection>true</use-duplicate-detection>

        <!-- The goal is to have a primary / backup solution, not load
balancing -->

        <message-load-balancing>OFF</message-load-balancing>    

        <max-hops>1</max-hops>                                  

        <!-- load balacning only to other Artemis directly connected to this
server -->

        <static-connectors>

            <connector-ref>primary</connector-ref>

        </static-connectors>

    </cluster-connection>

</cluster-connections>

smime.p7s
Description: S/MIME cryptographic signature

Non working Artemis HA on Kubernetes

Reply via email to