Hello Artemis community!
We are trying to create a solution using Artemis within Kubernetes. Our requirement is that if a broker is offline for some reason - network issue or the instance shuts down either gracefully or suddenly - the clients can still produce and consume messages as usual. To do that we tried a High Availability solution with Primary + Backup brokers (both with replicated and shared storage) Each broker is a Stateful Set and there's Kubernetes Service to ensure connectivity to the brokers. To connect to the primary and backup we consider Headless Services (https://kubernetes.io/docs/concepts/services-networking/service/#headless-s ervices). For example, artemis-ha-primary-0.svc-artemis-ha.svc.cluster.local is the Headless Service to access Pod artemis-ha-primary-0 through Service svc-artemis-ha-primary. We tested both on Artemis v2.38.0 and Artemis v2.42.0. We tested the solution by 1. Producing messages using artemis producer command: "./bin/artemis producer --url "(tcp://artemis-ha-primary-0.svc-artemis-ha.svc.cluster.local:61616,tcp://ar temis-ha-backup-0.svc-artemis-ha.svc.cluster.local:61616)?ha=true&retryInter val=1000&retryIntervalMultiplier=1.0&reconnectAttempts=-1" --destination queue://... --user ... --password ....--message-count 10000 --sleep 2 --verbose 2. Crashing the Primary instance (through deletion of the pod or scaling down of the Stateful Set we have Artemis within) We expected the client to be able to recover. However, we were not able to see that. Analyzing the brokers logs and console, there is connectivity between the primary and the backup (primary) INFO [org.apache.activemq.artemis.core.server] AMQ221035: Primary Server Obtained primary lock (backup) INFO [org.apache.activemq.artemis.core.server] AMQ221033: ** got backup lock INFO [org.apache.activemq.artemis.core.server] AMQ221031: backup announced DEBUG [org.apache.activemq.artemis.core.client.impl.ClientSessionFactoryImpl] Connected with the currentConnectorConfig=TransportConfiguration(name=primary, factory=org-apache-activemq-artemis-core-remoting-impl-netty-NettyConnectorF actory)?port=61616&host=svc-artemis-ha-primary-svc-cluster-local But when we crash the primary broker we can see that the backup instance enters a loop of ERROR [org.apache.activemq.artemis.core.client] AMQ214016: Failed to create netty connection java.net.UnknownHostException: svc-artemis-ha-primary.svc.cluster.local And never promotes itself to a live broker. We were able to make it work only when stopping the primary broker gracefully (artemis stop) INFO [org.apache.activemq.artemis.core.client] AMQ214036: Connection closure to svc-artemis-ha-primary.svc.cluster.local/....:61616 has been detected: AMQ219015: The connection was disconnected because of server shutdown [code=DISCONNECTED] INFO [org.apache.activemq.artemis.core.client] AMQ214036: Connection closure to svc-artemis-ha-primary.svc.cluster.local/...:61616 has been detected: AMQ219015: The connection was disconnected because of server shutdown [code=DISCONNECTED] INFO [org.apache.activemq.artemis.core.server] AMQ221010: Backup Server is now active We'd like to ask if what we are trying to do in the context of Kubernetes is possible, if we have stumbled upon a bug or limitation or if we did something wrong. Thank you very much in advance for the help. I'm available for additional questions if needed. Best regards, Pedro Teixeira PS: For reference, our broker configurations are as follows: -- Primary <connectors> <connector name="self">tcp://artemis-ha-primary-0.svc-artemis-ha.svc.cluster.local:6161 6</connector> <connector name="backup">tcp://artemis-ha-backup-0.svc-artemis-ha.svc.cluster.local:616 16</connector> </connectors> <ha-policy> <replication> <primary> </primary> </replication> </ha-policy> <cluster-user>ACTIVEMQ.CLUSTER.ADMIN.USER</cluster-user> <cluster-password>....</cluster-password> <cluster-connections> <cluster-connection name="artemis-cluster"> <!-- All addresses --> <address></address> <connector-ref>self</connector-ref> <retry-interval>500</retry-interval> <use-duplicate-detection>true</use-duplicate-detection> <message-load-balancing>OFF</message-load-balancing> <max-hops>1</max-hops> <static-connectors> <connector-ref>backup</connector-ref> </static-connectors> </cluster-connection> </cluster-connections> -- Backup <ha-policy> <replication> <backup> <!-- ensure the backup that has become active never stops so it's ready to be backup again, otherwirse it will enter CrashLoopBackOff automatically stop per <https://activemq.apache.org/components/artemis/documentation/latest/ha.html #failback-with-shared-store> https://activemq.apache.org/components/artemis/documentation/latest/ha.html# failback-with-shared-store --> <allow-failback>true</allow-failback> <restart-backup>true</restart-backup> </backup> </replication> </ha-policy> <!-- Configure the cluster connection --> <cluster-user>ACTIVEMQ.CLUSTER.ADMIN.USER</cluster-user> <cluster-password>...</cluster-password> <cluster-connections> <cluster-connection name="artemis-cluster"> <!-- All addresses --> <address></address> <connector-ref>self</connector-ref> <retry-interval>500</retry-interval> <use-duplicate-detection>true</use-duplicate-detection> <!-- The goal is to have a primary / backup solution, not load balancing --> <message-load-balancing>OFF</message-load-balancing> <max-hops>1</max-hops> <!-- load balacning only to other Artemis directly connected to this server --> <static-connectors> <connector-ref>primary</connector-ref> </static-connectors> </cluster-connection> </cluster-connections>
smime.p7s
Description: S/MIME cryptographic signature
