Re: Cluster bridge and message order

Justin Bertram Wed, 19 Apr 2023 19:42:34 -0700

I was able to take the archive you attached and reproduce the issue in just
a few minutes. Thanks for the great reproducer!

During reproduction I noticed something odd in the log. In a two-node
cluster you would expect each node to have 1 bridge each (i.e. going to the
*other* node of the cluster). However, after killing and restarting node 1
each node actually had more than one bridge. After looking at your
configuration more closely I saw that you had disabled persistence (i.e.
using <persistence-enabled>false</persistence-enabled>). This has a
specific impact on a clustered configuration because when a node starts
with an empty journal it generates a unique node ID and persists it to
disk. This ID is what identifies the node in the cluster so that everybody
in the cluster "knows" who everybody else is. When a node is restarted for
any reason the other nodes in the cluster are able to recognize it as the
same node based on the ID. However, when you disable persistence you
disable the persistent ID so every time a node restarts it is seen as a
"new" node in the cluster. Given that you're using the default
reconnect-attempts of -1 (i.e. infinite) on your cluster-connection that
means every time you restart a node all the other nodes in the cluster will
keep trying to reconnect to this never-to-return node forever. Furthermore,
they'll be trying to reconnect every 500 milliseconds. This reconnection
thrashing appears to be causing the ordering issue because as soon as I
enabled persistence I was unable to reproduce the problem anymore. I also
tried leaving persistence disabled and also setting reconnect-attempts = 0
and that also appears to have solved the problem.

I don't yet know *why* the reconnection thrashing appears to be causing the
problem, but I believe you can effectively work-around the issue either by
enabling persistence or disabling reconnection or at least setting
reconnect-attempts to a low value and increasing the retry-interval (e.g.
using 5 and 10000 respectively).

Hope that helps!

Justin

On Thu, Mar 16, 2023 at 7:59 AM Oliver Lins <l...@lins-it.de> wrote:

> Hi,
>
> I've attached an archive containing the test apps, logs and a readme file.
>
> If you have any questions pls let me know.
>
> Thank you,
> Oliver
>
> On 3/15/23 16:31, Justin Bertram wrote:
> > I just need a way to reproduce what you're seeing so once you get your
> > reproducer in order let me know. Thanks!
> >
> >
> > Justin
> >
> > On Wed, Mar 15, 2023 at 9:36 AM Oliver Lins <l...@lins-it.de> wrote:
> >
> >> Hi Justin,
> >>
> >> thank you for your fast reply.
> >>
> >>   > Would it be possible for you to work up a way to reproduce the
> >> behavior you're seeing?
> >> Yes, I can reproduce the behavior. I have simplified producer and
> >> consumer Java code to reproduce.
> >> The code is not yet the bare minimum necessary to work, but I can change
> >> that.
> >>
> >>   >  If so, is the order-of-creation only essential per producer [...]
> >> Yes, the order is only essential per producer.
> >>
> >> Please let me know how I can assist you.
> >>
> >> Thank you,
> >> Oliver
> >>
> >> On 3/15/23 14:58, Justin Bertram wrote:
> >>> Based on your description, attached configuration, and logs I don't see
> >>> anything wrong, per se. Would it be possible for you to work up a way
> to
> >>> reproduce the behavior you're seeing?
> >>>
> >>> Do you ever have more than 1 producer? If so, is the order-of-creation
> >> only
> >>> essential per producer or is it essential across all producers?
> >>>
> >>>
> >>> Justin
> >>>
> >>> On Wed, Mar 15, 2023 at 8:29 AM Oliver Lins <l...@lins-it.de> wrote:
> >>>
> >>>> Hi,
> >>>>
> >>>> we are using Artemis with the following setup:
> >>>> - 2 independent broker instances (on 2 hosts)
> >>>> - a cluster configuration to create a Core bridge between both
> instances
> >>>> (no failover, no HA)
> >>>> - multiple JMS clients produce and consume AMQP messages using topics
> >>>> - the clients do a failover themself
> >>>> - Artemis versions (2.21.0, 2.29.0-SNAPSHOT cloned on 08.03)
> >>>>
> >>>> Every thing is working fine. Independent of the Artemis instance the
> >>>> producer or consumers are connected to they receive all messages in
> the
> >>>> order of creation.
> >>>>
> >>>> To simulate a server failure we kill (-9) Artemis instance 1 and
> restart
> >>>> the instance again (~ 1/2 minute later).
> >>>> - 1 producer connects to the restarted instance 1
> >>>> - multiple consumers are (still) connected to instance 2
> >>>> - 1 consumer connects to the restarted instance 1
> >>>>
> >>>> The producer sends messages with a delay of 1 ms.
> >>>> Now we see that
> >>>> - the order of messages received by the consumer connected to
> instance 1
> >>>> frequently does not match the order the messages are created
> >>>> - the order of messages received by consumers connected to instance 2
> >>>> matches the order the messages are created
> >>>>
> >>>> It is essential for us that the messages arrive in the order of
> >> creation.
> >>>> Do you have any ideas what went wrong or we are doing wrong?
> >>>>
> >>>> Thanks in advance,
> >>>> Oliver
> >>>>
> >>>> Pls note: the attached files are used to reproduce what we saw in
> >>>> production.
> >>>>        This test configuration uses 1 docker instance per Artemis
> broker.
> >>>>        Both instances are running on the same host using different
> ports.
> >> --
> >> Dipl.-Ing. FH der technischen Informatik
> >> Tel.: +49 179 2911883
> >> Email: ol...@lins-it.de
> >> Internet:
> >>          http://www.lins-it.de
> >>
> >>
>
> --
> Dipl.-Ing. FH der technischen Informatik
> Tel.: +49 179 2911883
> Email: ol...@lins-it.de
> Internet:
>         http://www.lins-it.de
>

Re: Cluster bridge and message order

Reply via email to